Can No-reference features help in Full-reference image quality estimation?

Saikat Dutta
IIT Madras
[email protected] Sourya Dipta Das
Jadavpur University
[email protected] Nisarg A. Shah
IIT Jodhpur
[email protected]

Abstract

Development of perceptual image quality assessment (IQA) metrics has been of significant interest to computer vision community. The aim of these metrics is to model quality of an image as perceived by humans. Recent works in Full-reference IQA research perform pixelwise comparison between deep features corresponding to query and reference images for quality prediction. However, pixelwise feature comparison may not be meaningful if distortion present in query image is severe. In this context, we explore utilization of no-reference features in Full-reference IQA task. Our model consists of both full-reference and no-reference branches. Full-reference branches use both distorted and reference images, whereas No-reference branch only uses distorted image. Our experiments show that use of no-reference features boosts performance of image quality assessment. Our model achieves higher SRCC and KRCC scores than a number of state-of-the-art algorithms on KADID-10K and PIPAL datasets.

1 Introduction

The use of image processing to improve the quality of the content to an acceptable level for human viewers has been of substantial interest to the computer vision community. For achieving this, a significant step is to accurately measure the perceptual quality of the content, and it has been found to be critical in several computer vision applications [4, 22]. The primary weakness of traditional deep learning-based convolutional neural networks is that they usually generate overly smooth images instead of highly textured images because of the unavailability of the proper metrics for training these networks and quantifying these perceptual quality effects. To an extent, Generative Adversarial Networks (GAN) [8] address this problem while learning distributions with pointed peaks and adopting the adversarial training loss for image restoration tasks. Using this, they could generate sharp and visually appealing images compared to models trained without adversarial loss. Nonetheless, such adversarially trained models often obtain lower scores than those trained without adversarial loss evaluated on multiple traditional, well-known metrics such as PSNR and SSIM [27]. Efficiency of these metrics are inferior, specifically for assessing textures and fine details in the generated image [17]. Since the ultimate goal of image enhancement models is to render visually pleasing images for humans and achieve a high Mean Opinion Score (MOS), developing a robust metric for IQA is essential. Neural perceptual image quality metrics can also be utilized as a loss function to train deep networks for image or video restoration.

Perceptual IQA methods are of two different kinds: Full-reference methods where a distorted query image is compared against a reference image and No-reference methods where the quality of a query image is evaluated without any reference image. The common approach used in Full-reference IQA (FR-IQA) task is to extract features using a network from both reference and query images and predict perceptual quality score based on interaction ( $L_{2}$ distance, dot product etc.) between reference and query features. However, the contribution of features extracted only from the query image in FR-IQA task is under-explored. In this paper, we propose a multi-branch model for FR-IQA problem which utilizes both Full-reference and No-reference feature extraction. In one Full-reference branch, we compute difference of ImageNet features extracted from query and reference images, whereas learnt features are compared in other Full-reference branch. No-reference branch uses only query image to extract features. Features from these three branches are concatenated and fed to fully connected layers to predict the quality score. Our model surpasses state-of-the-art FR-IQA models on two benchmark datasets.

2 Related works

Image Quality Assessment (IQA) methods are used to assess the quality of pictures that may have been deteriorated during processing such as generation, compression, denoising, and style transfer. IQA algorithms may be classified into no-reference and full-reference approaches based on distinct settings. No reference methods are designed to assess image quality without the need of a reference image [19, 34]. On the other hand, image quality assessment in full reference setting takes in account of both distorted and reference image.

SSIM [27], MS-SSIM [28], PSNR, and other full-reference approaches are extensively used. They inspired the development of FSIM [32], SR-SIM [31], and GMSD [30]. These hand-crafted approaches compare the feature difference between the deformed picture and the reference image to determine image quality. Deep learning-based full-reference algorithms [21, 2] have recently been shown to outperform hand-crafted systems in terms of model performance.

Bosse et al. [2] implemented an architecture within a unified framework which allows joint learning of local quality and local weights. In other words, the relative importance of local quality to the final quality estimate is learnt. Their proposed architecture can be used in both No-reference (NR) and Full-reference (FR) IQA setting with minor modifications. Zhang et al. [33] showed deep features extracted from internal activations networks trained for high-level classification tasks, represent human perceptual similarity remarkably well, outperforming widely accepted metrics like SSIM, PSNR, FSIM and so on. They improved the performance of their network by calibrating feature outputs from a pre-trained network with their proposed BAPPS dataset. Keyan et al. [5] proposed the first full-reference IQA model with texture resampling tolerance. Their proposed method blends spatial average correlations as texture similarity with feature map correlations as structure similarity.

Prashnani et al. [21] made a large-scale dataset annotated with the probability that humans will prefer one image over another as human perceptual error. Then, using a unique pairwise-learning framework, they further proposed a new metric, PieAPP, estimated by a deep-learning model to predict which deformed image would be preferred over the other which is well correlated with human opinion. Gu et al. [9] contributed Perceptual Image Processing ALgorithms (PIPAL) dataset, which is a large-scale IQA dataset. This dataset, in particular, contains the results of GAN-based Image Restoration methods, which were not included in earlier datasets. They also proposed the Space Warping Difference Network, which incorporates $L_{2}$ pooling layers and Space Warping Difference layers, to improve an IQA network’s performance on GAN-based distortion by explicitly addressing spatial misalignment.

Cheon et al. [3] adapt Vision transformers [7] for perceptual IQA. They have used a pretrained Inception-Resnet-v2 network [25] as feature extraction backbone and transformer encoder-decoder architecture to obtain quality score predidction. Shi et al. [23] proposed Region-Adaptive Deformable Network which uses reference-oriented deformable convolution to improve performance of the network on GAN-based distortion by adaptively accounting spatial misalignment. Their patch-level attention module contributes to enhance the interaction between distinct patch regions that were previously processed separately.

Refer to caption — Figure 1: Overview of our FR-IQA model. Both query and reference images are fed to FRP and FRNP branches. NR branch takes only query image as input. Features from all the branches are concatenated and passed to three fully connected layers to predict the final quality score.

Guo et al. [10] sample random patches from both query and reference images and extract features from different scales from a feature extraction module. A quality score per scale is generated with the help of Feature Fusion and Score Regression modules and these scores are averaged to obtain image-level quality score. Ayyoubzadeh et al. [1] use a Siamese-difference network equipped with spatial and channel attention. They also develop a Surrogate Ranking loss to improve Spearman Rank correlation score. Hammou et al. [11] compute difference of features extracted from different layers of pretrained VGG16 network [24]. These features are fed to an ensemble consisting of XGBoost, LIGHTGBM and CatBoost to predict the quality score.

3 Proposed method

Given a reference image $I_{r}$ and a distorted query image $I_{q}$ , our goal is to predict a quality score $p$ which correlates with the perceived quality of $I_{q}$ . Our model consists of three parallel branches: (a) Full-reference pretrained (FRP) branch (b) Full-reference non-pretrained (FRNP) branch and (c) No-reference (NR) branch. For Full-reference branches (FRP and FRNP), both query and reference images are fed as input, whereas in No-reference branch only the query image is fed as input. In each of the branches, we have a convolutional neural network based encoder. In FRP branch, we use encoder from a classifier trained on ImageNet since these features are known to correlate well with perceptual quality [15, 13, 33]. Weights of FRP encoder is kept fixed throughout the training. We don’t use pretrained encoders in FRNP and NR branches to enable learning of discriminative features from the training data.

Full-reference branches focus on extracting features from both query and reference images and compute pixel-wise difference between query and reference features. But when the distortion is severe, computing pixel-wise difference, even in feature space, may not be optimal due to spatial misalignment. Hence, we use a No-reference branch to focus only on features related to the distortion present in the query image.

Let’s denote encoders of FRP branch, FRNP branch and NR branch as $E^{FRP}$ , $E^{FRNP}$ and $E^{NR}$ respectively. We describe details about Full-reference and No-reference branches in the following.

Full-reference branches: In a full-reference branch $b$ , we extract multi-scale features from corresponding encoder $E^{b}$ for both query and reference images. We obtain features from four different scales, $\phi^{b}_{s}(I_{t})$ where $s\in\{1,2,3,4\}$ and $I_{t}\in\{I_{q},I_{r}\}$ . Spatial resolution of features in scale $s$ is $(h/2^{s}\times w/2^{s})$ where $(h,w)$ is resolution of the input images. Then we compute difference features $d^{b}_{s}$ from each scale and concatenate them to obtain $d^{b}$ . Hence, $d^{FRP}$ and $d^{FRNP}$ are given by,

	$\displaystyle d^{FRP}_{s}=GAP(abs(\phi^{FRP}_{s}(I_{q})-\phi^{FRP}_{s}(I_{r})))$		(1)
	$\displaystyle d^{FRP}=d^{FRP}_{1}\oplus d^{FRP}_{2}\oplus d^{FRP}_{3}\oplus d^{FRP}_{4}$		(2)
	$\displaystyle d^{FRNP}_{s}=GAP(abs(\phi^{FRNP}_{s}(I_{q})-\phi^{FRNP}_{s}(I_{r})))$		(3)
	$\displaystyle d^{FRNP}=d^{FRNP}_{1}\oplus d^{FRNP}_{2}\oplus d^{FRNP}_{3}\oplus d^{FRNP}_{4}$		(4)

where $abs(.)$ is absolute value, $GAP(.)$ is Global Average Pooling layer and $\oplus$ stands for feature concatenation.

No-reference branch: Since reference image is not used as an input to No-reference (NR) branch, we obtain no-reference features from this branch. Similar to Full-reference branches, multi-scale features are extracted followed by Global Average pooling and feature concatenation.

	$\displaystyle f^{NR}_{s}=GAP(\phi^{NR}_{s}(I_{q}))$		(5)
	$\displaystyle f^{NR}=f^{NR}_{1}(I_{q})\oplus f^{NR}_{2}(I_{q})\oplus f^{NR}_{3}(I_{q})\oplus f^{NR}_{4}(I_{q})$		(6)

Finally, difference features from full-reference branches and no-reference features are concatenated and passed to three fully-connected layers to predict the quality score. Overview of our model is shown in Figure 1.

4 Experiments

4.1 Implementation and training details

We implement our models with Pytorch [20] deep learning framework. We use Adam optimizer [16] with initial learning rate of $10^{-4}$ and batch size of 8. We gradually decrease the learning rate to $10^{-6}$ . Horizontal and vertical flipping are used to augment the trainset. Mean Squared Error (MSE) loss is used as training objective. We use a machine with one NVIDIA 1080Ti GPU for our experiments.

4.2 Dataset Description

We use KADID-10K [18] and PIPAL [14] datasets in our experiments.

KADID-10K: KADID-10K dataset has 81 high-quality reference images of resolution $512\times 384$ . There are 10,125 query images of 25 traditional distortion types present in this dataset. Mean Opinion Score (MOS) range of this dataset is 1 to 5, where 1 stands for poorest quality and 5 stands for highest quality. We perform 80%-20% split on KADID-10K dataset for training and evaluation respectively.

PIPAL: PIPAL dataset consists of 250 reference images of size $288\times 288$ and 29K distorted images of total 40 distortion types. This dataset contains not only traditional distortions, but also distortions produced by different image restoration algorithms including GANs. MOS scores in this dataset lies roughly within 900 and 1850, where higher score denotes better perceptual quality. We use publicly available train split of PIPAL dataset for evaluation.

4.3 Result

We have compared our approach against two state-of-the-art Full-reference IQA methods: WaDIQaM [2], LPIPS-Alex, LPIPS-VGG [33], and DISTS [6]. Absolute values of Spearman Rank Correlation Coefficient (SRCC) and Kendall Rank Correlation Coefficient (KRCC) are reported in Table-1 for both the datasets. Quantitative results demonstrate that our approach performs better than the other state-of-the-art methods.

Table 1: Quantitative comparison with state-of-the-art methods.

Method	KADID-10K		PIPAL
Method	SRCC	KRCC	SRCC	KRCC
WaDIQaM	0.8909	0.7079	0.6250	0.4459
LPIPS-Alex	0.8137	0.6222	0.5870	0.4112
LPIPS-VGG	0.7138	0.5269	0.5735	0.4048
DISTS	0.7966	0.6094	0.5785	0.4065
Ours	0.9536	0.8128	0.6580	0.4738

4.4 Ablation Study

Effect of different branches: We have trained our model after removing NR and FRNP branches to understand the importance of these branches. For this experiment, we have used VGG16 backbone as feature extractor in the corresponding branches. From Table-2, we can infer that FRP along with FRNP branch performs better than only FRP branch since FRNP branch learns dicriminative features based on different distortions present in the training data. We achieve the best performance when all three branches are used together among other configurations. In Figure-2, we have shown qualitative results for different configurations on distorted images of PIPAL dataset. We can see that predictions from the full model (FRP+FRNP+NR) preserves the original quality ranking unlike other configurations. This shows that no-reference features learnt in the NR branch aids in FR-IQA task.

Table 2: Quantitative results for different model configurations.

FRP branch	FRNP branch	NR branch	KADID-10K		PIPAL
FRP branch	FRNP branch	NR branch	SRCC	KRCC	SRCC	KRCC
✓			0.9200	0.7579	0.5650	0.3971
✓	✓		0.9315	0.7765	0.6215	0.4422
✓	✓	✓	0.9436	0.7952	0.6460	0.4617

Choice of backbone: In our experiments, we have used three different backbones in FRP, FRNP and NR branches: VGG16 [24], Resnet50 [12] and Inception-v3 [26]. Quantitative results are summarized in Table 3. Our model performs the best on KADID-10K dataset when Inception-v3 is used as feature backbone, whereas we achieve best performance on PIPAL dataset when Resnet50 is used as backbone. We have chosen Inception-v3 backbone in our final model.

Table 3: Quantitative results for different feature extraction backbones.

Backbone	KADID-10K		PIPAL
Backbone	SRCC	KRCC	SRCC	KRCC
VGG16	0.9436	0.7952	0.6460	0.4617
ResNet50	0.9493	0.8068	0.6613	0.4761
Inception-v3	0.9536	0.8128	0.6580	0.4738

5 Conclusion

In this paper, we have proposed a multi-branch network for image quality assessment in full-reference setting. Full-reference branches computes feature from both query and reference images whereas No-reference branch extracts feature only from the query image. Full reference branches utilize both Imagenet pretrained features as well as learnt features. Our experiments show that addition of no-reference branch indeed improves FR-IQA performance. The proposed model outperforms state-of-the-art algorithms on two benchmark datasets. In future, incorporation of spatial and channel attention [29] can be utilized to reweigh features in different branches and improve quality score predictions.

References

[1] Seyed Mehdi Ayyoubzadeh and Ali Royat. (asna) an attention-based siamese-difference neural network with surrogate ranking loss function for perceptual image quality assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 388–397, 2021.
[2] Sebastian Bosse, Dominique Maniry, Klaus-Robert Müller, Thomas Wiegand, and Wojciech Samek. Deep neural networks for no-reference and full-reference image quality assessment. IEEE Transactions on image processing, 27(1):206–219, 2017.
[3] Manri Cheon, Sung-Jun Yoon, Byungyeon Kang, and Junwoo Lee. Perceptual image quality assessment with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 433–442, 2021.
[4] Shyamprasad Chikkerur, Vijay Sundaram, Martin Reisslein, and Lina J Karam. Objective video quality assessment methods: A classification, review, and performance comparison. IEEE transactions on broadcasting, 57(2):165–182, 2011.
[5] Keyan Ding, Kede Ma, Shiqi Wang, and Eero P. Simoncelli. Image quality assessment: Unifying structure and texture similarity. IEEE transactions on pattern analysis and machine intelligence, PP, 2020.
[6] K. Ding, K. Ma, S. Wang, and E. P. Simoncelli. Image quality assessment: Unifying structure and texture similarity. IEEE Transactions on Pattern Analysis & Machine Intelligence, (01):1–1, dec 5555.
[7] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
[9] Jinjin Gu, Haoming Cai, Haoyu Chen, Xiaoxing Ye, Jimmy Ren, and Chao Dong. Image quality assessment for perceptual image restoration: A new dataset, benchmark and metric. arXiv preprint arXiv:2011.15002, 2020.
[10] Haiyang Guo, Yi Bin, Yuqing Hou, Qing Zhang, and Hengliang Luo. Iqma network: Image quality multi-scale assessment network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 443–452, 2021.
[11] Dounia Hammou, Sid Ahmed Fezza, and Wassim Hamidouche. Egb: Image quality assessment based on ensemble of gradient boosting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 541–549, 2021.
[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[13] Xin Hong, Pengfei Xiong, Renhe Ji, and Haoqiang Fan. Deep fusion network for image completion. In Proceedings of the 27th ACM international conference on multimedia, pages 2033–2042, 2019.
[14] Gu Jinjin, Cai Haoming, Chen Haoyu, Ye Xiaoxing, Jimmy S Ren, and Dong Chao. Pipal: a large-scale image quality assessment dataset for perceptual image restoration. In European Conference on Computer Vision, pages 633–651. Springer, 2020.
[15] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pages 694–711. Springer, 2016.
[16] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[17] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4681–4690, 2017.
[18] Hanhe Lin, Vlad Hosu, and Dietmar Saupe. Kadid-10k: A large-scale artificially distorted iqa database. In 2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX), pages 1–3. IEEE, 2019.
[19] Xialei Liu, Joost Van De Weijer, and Andrew D Bagdanov. Rankiqa: Learning from rankings for no-reference image quality assessment. In Proceedings of the IEEE International Conference on Computer Vision, pages 1040–1049, 2017.
[20] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
[21] Ekta Prashnani, Hong Cai, Yasamin Mostofi, and Pradeep Sen. Pieapp: Perceptual image-error assessment through pairwise preference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1808–1817, 2018.
[22] Hamid R Sheikh, Muhammad F Sabir, and Alan C Bovik. A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Transactions on image processing, 15(11):3440–3451, 2006.
[23] Shuwei Shi, Qingyan Bai, Mingdeng Cao, Weihao Xia, Jiahao Wang, Yifan Chen, and Yujiu Yang. Region-adaptive deformable network for image quality assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 324–333, 2021.
[24] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[25] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-first AAAI conference on artificial intelligence, 2017.
[26] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
[27] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
[28] Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, volume 2, pages 1398–1402. Ieee, 2003.
[29] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV), pages 3–19, 2018.
[30] Wufeng Xue, Lei Zhang, Xuanqin Mou, and Alan C Bovik. Gradient magnitude similarity deviation: A highly efficient perceptual image quality index. IEEE transactions on image processing, 23(2):684–695, 2013.
[31] Lin Zhang and Hongyu Li. Sr-sim: A fast and high performance iqa index based on spectral residual. In 2012 19th IEEE international conference on image processing, pages 1473–1476. IEEE, 2012.
[32] Lin Zhang, Lei Zhang, Xuanqin Mou, and David Zhang. Fsim: A feature similarity index for image quality assessment. IEEE transactions on Image Processing, 20(8):2378–2386, 2011.
[33] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
[34] Hancheng Zhu, Leida Li, Jinjian Wu, Weisheng Dong, and Guangming Shi. Metaiqa: Deep meta-learning for no-reference image quality assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14143–14152, 2020.