HiREN: Towards Higher Supervision Quality for Better Scene Text Image Super-Resolution

Minyi Zhao, Yi Xu, Bingjia Li, Jie Wang, Jihong Guan, and Shuigeng Zhou, Manuscript received July 25, 2023.Minyi Zhao, Yi Xu, Bingjia Li and Shuigeng Zhou (Corresponding author) are with Shanghai Key Lab of Intelligent Information Processing, and School of Computer Science, Jiangwan Campus, Fudan University, 2005 Songhu Road, Shanghai, 200438, China. E-mail: [email protected]; [email protected]; [email protected]; [email protected] Wang is with ByteDance Inc, Beijing 100098, China. E-mail: [email protected] Guan is with Department of Computer Science and Technology, Tongji University, 4800 Caoan Road, Shanghai, 201804, China. E-mail: [email protected]

Abstract

Scene text image super-resolution (STISR) is an important pre-processing technique for text recognition from low-resolution scene images. Nowadays, various methods have been proposed to extract text-specific information from high-resolution (HR) images to supervise STISR model training. However, due to uncontrollable factors (e.g. shooting equipment, focus, and environment) in manually photographing HR images, the quality of HR images cannot be guaranteed, which unavoidably impacts STISR performance. Observing the quality issue of HR images, in this paper we propose a novel idea to boost STISR by first enhancing the quality of HR images and then using the enhanced HR images as supervision to do STISR. Concretely, we develop a new STISR framework, called High-Resolution ENhancement (HiREN) that consists of two branches and a quality estimation module. The first branch is developed to recover the low-resolution (LR) images, and the other is an HR quality enhancement branch aiming at generating high-quality (HQ) text images based on the HR images to provide more accurate supervision to the LR images. As the degradation from HQ to HR may be diverse, and there is no pixel-level supervision for HQ image generation, we design a kernel-guided enhancement network to handle various degradation, and exploit the feedback from a recognizer and text-level annotations as weak supervision signal to train the HR enhancement branch. Then, a quality estimation module is employed to evaluate the qualities of HQ images, which are used to suppress the erroneous supervision information by weighting the loss of each image. Extensive experiments on TextZoom show that HiREN can work well with most existing STISR methods and significantly boost their performances.

Index Terms:

Scene text image super-resolution, scene text recognition, super-resolution, resolution enhancement

Refer to caption — Figure 1: Overview of existing STISR approaches and our method, and examples illustrating the quality problem of HR images. (a) The framework of existing STISR methods; (b) The HiREN framework; (c) Some examples of low-quality HR images and their enhanced results (HQ) by our method, as well as the recognized results. For each case, the 1st row shows HR and HQ images, the 2nd row presents the normalized HR and HQ images to highlight their visual differences, and the 3rd row gives the recognized characters: red indicates incorrectly recognized, and black means correctly recognized.

I Introduction

Scene text recognition [1, 2] (STR), which aims at recognizing texts from scene images has wide applications in scene text based image understanding (e.g. auto-driving [3], TextVQA [4], Doc-VQA [5], and ViteVQA [6]). Despite the fact that STR has made great progress with the rapid blossom of deep learning in recent years, performance of text recognition from low-resolution (LR) text images is still unsatisfactory [7]. Therefore, scene text image super-resolution (STISR) [8, 9, 7] is gaining popularity as a pre-processing technique to recover the missing details in LR images for boosting text recognition performance as well as the visual quality of the scene texts.

As shown in Fig. 1(a), recent STISR works usually try to directly capture pixel-level (via $L1$ or $L2$ loss) or text-specific information from high-resolution (HR) text images to supervise the training of STISR models. For instance, Gradient profile loss [7] calculates the gradient fields of HR images as ground truth for sharpening the boundaries of the super-resolution (SR) images. PCAN [10] is proposed to learn sequence-dependent features and high-frequency information of the HR images to better reconstruct SR text images. STT [8] exploits character-level attention maps from HR images to assist the recovery. [11] and TG [9] extract stroke-level information from HR images through specific networks to provide more fine-grained supervision information. [12, 13, 14] additionally introduce external modules to extract various text-specific clues to facilitate the recovery and use the supervision from HR images to finetune their modules.

Although various techniques that extract information from the HR images have been proposed to improve the recognition accuracy, they all assume that the HR images are completely trustworthy, which is actually not true, due to the uncontrollable factors (e.g. shooting equipment, focus, and environment) in manually photographing the HR images. As shown in Fig. 1(c), the HR images may suffer from blurring (the 1st and 2nd cases) and low-contrast (the 3rd case), which unavoidably impacts the performance of STISR. In the worst case, these quality issues may cause the failure of recognition on HR images and lead to wrong supervision information. What is worse, the HR quality problem in real world is absolutely not negligible, as the recognition accuracy on HR images can be as low as 72.4% (see Tab. II).

Considering the fact that improving the photographing of LR/HR images and eliminating environmental impacts are extremely expensive (if not impossible) in the wild, and applying huge models for extracting more accurate information is also time-consuming and costly, in this paper we propose a novel solution to advance STISR by first enhancing the quality of HR images and then using the enhanced HR images as supervision to perform STISR. To this end, we develop a new, general and easy-to-use STISR framework called High-Resolution ENchancement (HiREN) to improve STISR by providing more accurate supervision. In particular, as shown in Fig. 1(b), besides the typical LR recovery branch, HiREN additionally introduces an HR enhancement branch that aims at improving the quality of HR images and a quality estimation (QE) module to conduct a quality-aware supervision. Here, the resulting high-quality (HQ) images, instead of the HR images as in existing works, are used to supervise the LR recovery branch. Note that the degradation from HQ to HR is unknown, and there is no explicit supervision for HR enhancement, existing STISR approaches are not able to solve the task of HR enhancement. To tackle these problems, on the one hand, we introduce a degradation kernel predictor to generate the degradation kernel and then use this kernel as a clue to enhance various degraded HR images. On the other hand, we exploit the feedback of a scene text recognizer and text-level annotations as weak supervision signal to train the HR enhancement branch. What is more, to suppress the erroneous supervision information, a quality estimation (QE) module is proposed to evaluate the quality of the HQ images through the normalized Levenshtein similarity [15] of the recognized text and the ground truth, and then use this quality estimation to weight the loss of each HQ image.

Such design above offers our method four-fold advantages:

•

General. Our framework can work with most existing STISR approaches in a plug-and-play manner.
•

Easy-to-use. After training the HR enhancement branch, our method can be plugged online to the training of existing techniques easily.
•

Efficient. HiREN does not introduce additional cost during inference. What is more, HiREN can also be deployed offline by caching all the enhanced HR images. This offline deployment does not introduce any additional training cost.
•

High-performance. Our method can significantly boost the performances of existing methods.

Contributions of this paper are summarized as follows:

•

We propose a novel approach for STISR. To the best of our knowledge, this is the first work to consider and exploit the quality of HR images in STISR. That is, different from existing approaches that extract various text-specific information, Our work pioneers the exploration of the quality issue of HR images.
•

We develop a general, efficient and easy-to-use High-Resolution ENhancement (HiREN) framework to boost STISR by improving the supervision information from the HR images.
•

We conduct extensive experiments on TextZoom, which show that HiREN is compatible with most existing STISR methods and can significantly lift their performances.

The rest of this paper is organized as follows: Section II surveys related works and highlights the differences between our method and the existing ones; Section III presents our method in detail; Section IV introduce the experimental results of our method and performance comparisons with existing methods; Section V further discusses the quality issues of HR images, error cases and limitations of the proposed method; Section VI concludes the paper while pinpointing some issues of future study.

II Related Work

In this section, we briefly review the super-resolution techniques and some typical scene text recognizers. According to whether exploiting text-specific information from HR images, recent STISR methods can be roughly divided into two groups: generic super-resolution approaches and scene text image super-resolution approaches.

II-A Generic Image Super-Resolution

Generic image super-resolution methods [16, 17, 18, 19] usually recover LR images through pixel information from HR images captured by pixel loss functions. In particular, SRCNN [20] is a three-layer convolutional neural network. [21] and SRResNet [22] adopt generative adversarial networks to generate distinguishable images. [23] employs convolutional layers, transposed convolution and sub-pixel convolution layers to extract and upscale features. RCAN [24] and SAN [25] introduce attention mechanisms to boost the recovery. Nowadays, transformer-structured approaches [26, 27, 28] are proposed to further advance the task of generic image super-resolution. Nevertheless, these approaches ignore text-specific properties of the scene text images, which leads to low recognition performance when applied to STISR.

II-B Scene Text Image Super-Resolution

Recent approaches focus on extracting various text-specific information from the HR images, which is then utilized to supervise model training. Specifically, [29, 30] calculate text-specific losses to boost performance. [31] proposes a multi-task framework that jointly optimizes recognition and super-resolution branches. [7] introduces TSRN and gradient profile loss to capture sequential information of text images and gradient fields of HR images for sharpening the texts. PCAN [10] is proposed to learn sequence-dependent and high-frequency information of the reconstruction. STT [8] makes use of character-level information from HR images extracted by a pre-trained transformer recognizer to conduct a text-focused super-resolution. [32] proposes a content perceptual loss to extract multi-scale text recognition features to conduct a content aware supervision. TPGSR [12], TATT [13], and C3-STISR [14] extract text-specific clues to guide the super-resolution. In particular, TPGSR is the first method that additionally introduces a scene text recognizer to provide text priors. Then, the extracted priors are fed into the super-resolution to iteratively benefit the super-resolution. TATT [13] introduces a transformer-based module, which leverages global attention mechanism, to exert the semantic guidance of text prior to the text reconstruction process. C3-STISR [14] is proposed to learn triple clues, including recognition clue from a STR, linguistical clue from a language model, and a visual clue from a skeleton painter to rich the representation of the text-specific clue. TG [9] and [11] exploit stroke-level information from HR images via stroke-focused module and skeleton loss for more fine-grained super-resolution. Compared with generic image super-resolution approaches, these methods greatly advance the recognition accuracy through various text-specific information extraction techniques. Nevertheless, they all assume that HR images are completely trustable, which is actually not true in practice. As a result, their extracted supervision information may be erroneous, which impacts the STISR performance. Since HiREN applies these methods to implement the LR recovery branch, to elaborate the differences among various super-resolution techniques in this paper, we give a summary of these methods in Tab. I on three major aspects: how their super-resolution blocks and loss functions are designed, and whether they use iterative super-resolution technique to boost the performance.

TABLE I: Differences between typical STISR methods from three aspects: super-resolution block, loss function, and whether this method is iterative or not.

Method	Super-resolution block	Loss function $\mathcal{L}_{LR}$	Iterative
SRCNN [20]	SRCNN [20]	MSE	$\times$
SRResNet [22]	SRResNet [22]	MSE	$\times$
TSRN [7]	SRB [7]	Gradient profile loss [7]	$\times$
PCAN [10]	PCA [10]	Edge guidance loss [10]	$\times$
STT [8]	TBSRN [8]	Text-focused loss [8]	$\times$
TPGSR [12]	SRB [7]	Gradient profile loss [7]	$\checkmark$
TG [9]	SRB [7]	Stroke-focused loss [9]	$\times$

II-C Scene Text Recognition

Scene text recognition (STR) [33, 1, 2, 34, 35] has made great progress in recent years. Specifically, CRNN [36] takes CNN and RNN as the encoder and employs a CTC-based [37] decoder to maximize the probabilities of paths that can reach the ground truth. ASTER [38] introduces a spatial transformer network (STN) [39] to rectify irregular text images. MORAN [40] proposes a multi-object rectification network. [41, 42, 43] propose novel attention mechanisms. AutoSTR [44] searches backbone via neural architecture search (NAS) [45]. More recently, semantic-aware [46, 43], transformer-based [47], linguistics-aware [48, 49], and efficient [50, 51] approaches are proposed to further boost the performance. Although these methods are able to handle irregular, occluded, and incomplete text images, they still have difficulty in recognizing low-resolution images. For example, as can be seen in Sec. II, CRNN, MORAN, and ASTER only achieve the recognition accuracy of 27.3%, 41.1% and 47.2% respectively when directly using LR images as input. What is more, finetuning these recognizers is insufficient to accurately recognize texts from LR images, as reported in [7]. Therefore, a pre-processor is required for recovering the details of low-resolution images.

II-D Difference between Our Method and Existing STISR Works

The motivation of HiREN is totally different from that of existing STISR approaches. As described above, existing methods focus on extracting text-specific information from HR images to supervise STISR. On the contrary, HiREN first lifts the quality of HR images, then uses the enhanced images to supervise STISR. This allows HiREN to work with most existing STISR approaches and boost their recognition performances in a general, economic and easy-to-use way.

III Method

Here, we first give an overview of our framework HiREN, then briefly introduce the LR recovery branch. Subsequently, we present the HR enhancement branch and the quality estimation module in detail, followed by the usage of HiREN.

III-A Overview

Given a low-resolution (LR) image $I_{LR}\in\mathbb{R}^{C\times N}$ . Here, $C$ is the number of channels of the image, $N=H\times W$ is the collapsed spatial dimension, $H$ and $W$ are the height and width of image $I_{LR}$ . Our aim is to produce a super-resolution (SR) image $I_{SR}\in\mathbb{R}^{C\times(4\times N)}$ with the magnification factor of $\times 2$ . Fig. 2 shows the architecture of our framework HiREN, which is composed of two major branches: the LR recovery branch $f_{LR}$ that takes $I_{LR}$ as input to generate a super-resolution image $I_{SR}=f_{LR}(I_{LR})$ and a corresponding loss $\mathcal{L}_{o}$ , and the HR enhancement branch $f_{HR}$ that takes $I_{HR}$ as input to generate a high-quality (HQ) image $I_{HQ}=f_{HR}(I_{HR})$ where $I_{HQ}\in\mathbb{R}^{C\times(4\times N)}$ , and a quality estimation module $f_{QE}$ that takes $I_{HQ}$ and $\mathcal{L}_{o}$ as input to compute a quality-aware loss $\mathcal{L}_{LR}$ to supervise the LR branch:

\mathcal{L}_{LR}=f_{QE}(I_{HQ},\mathcal{L}_{o}).

(1)

During inference, $f_{HR}$ and $f_{QE}$ are removed. Thus, HiREN does not introduce extra inference cost.

III-B LR Recovery Branch

In HiREN, the LR recovery branch can be one of the existing STISR approaches. As shown in Fig. 2, these methods usually work in the following way: 1) Start with a spatial transformer network (STN) [39] since in the TextZoom dataset [7] the HR-LR pairs are manually cropped and matched by humans, which may incur several pixel-level offsets. 2) Several super-resolution blocks are used to learn sequence-dependent information of text images. 3) A pixel shuffle module is employed to reshape the super-resolved image. 4) Various loss functions are served as $\mathcal{L}_{o}$ to extract text-specific information from ground truth ( $I_{HR}$ in existing works, $I_{HQ}$ in HiREN) to provide the supervision. To elaborate the differences among various LR branches tested in this paper, we give a summary of these methods in Tab. I.

As the motivation of HiREN is totally different from that of the existing methods, our method can work with most of them and significantly improve their performances.

III-C HR Enhancement Branch

III-C1 Overall introduction.

The enhancement of HR images is a challenging task, where the challenges lie in two aspects that will be detailed in the sequel. Formally, the HR image $I_{HR}$ and the corresponding HQ image $I_{HQ}$ we are pursuing are connected by a degradation model as follows:

I_{HR}=k\otimes I_{HQ}+n,

(2)

where $\otimes$ denotes the convolution operation, $k$ is the degradation kernel, and $n$ is the additive noise that follows Gaussian distribution in real world applications [52, 53]. Different from the degradation from $I_{HR}$ to $I_{LR}$ where the kernel is determined by lens zooming, unfortunately, the degradation $k$ of $I_{HQ}$ is unknown. As shown in Fig. 1(c), such degradation can be but not limited to blurring (the 1st and 2nd cases) and low-contrast (the 3rd case). What is more, we also lack pixel-level supervision information of $I_{HQ}$ . These two challenges make existing STISR methods unable to enhance $I_{HR}$ . To cope with the first challenge, here we adopt blind image deblurring techniques [54, 55, 53, 52] to boost the recovery of $I_{HR}$ . Specifically, as shown in Fig. 2, our HR enhancement branch consists of two components: a kernel predictor $P$ and a kernel-guided enhancement network $f_{ke}$ . The kernel predictor aims at estimating the degradation kernel $k$ (i.e., $k=P(I_{HR})$ where $k\in\mathbb{R}^{d}$ , and $d$ is the size of the kernel), while the kernel-guided enhancement network takes the predicted kernel and $I_{HR}$ as input to conduct a kernel-guided enhancement: $I_{HQ}=f_{ke}(I_{HR},k)$ . The predicted kernel is utilized as a clue to strengthen the model’s ability to handle various degradation and boost the recovery of HR images. As for the second challenge, we introduce a pre-trained scene text recognizer $R$ to provide the supervision for generating more recognizable HQ images. And after training the HR enhancement branch $f_{HR}$ , HiREN uses the trained $f_{HR}$ to generate HQ images, which are exploited for training the LR recovery branch.

III-C2 The kernel predictor.

As shown in Fig. 3, to generate a prediction of the degradation kernel, we first utilize convolution layers to obtain a spatial estimation of the kernel. Then, we employ global average pooling [56] to output the global prediction by evaluating the spatial mean value. Thus, we can get the prediction of the kernel of size $\mathbb{R}^{d}$ , in a simple yet effective way.

III-C3 The kernel-guided enhancement network.

As shown in Fig. 3, our kernel-guided enhancement network is designed in the following way: 1) Start with an input convolution to change the channel number from $C$ to $C^{\prime}$ . 2) Repeat $N$ modified SRB blocks [7]. Each block consists of two convolution layers and one Bi-directional GRU [57] (BGRU) to handle sequential text images. At this step, we first stretch the predicted kernel $k$ to pixel shape, then concatenate the pixel kernel with the feature map extracted by convolution layers at channel dimension. 3) An output convolution is applied to getting the final enhanced HQ image $I_{HQ}$ .

III-C4 Loss functions.

Here, we design the loss functions of the HR enhancement branch $f_{HR}$ . As shown in Fig. 2, there are two loss functions in $f_{HR}$ . The first one is the recognition loss $\mathcal{L}_{rec}$ that is used to make the enhanced image $I_{HQ}$ to be more easily recognized than $I_{HR}$ . It is provided by a pre-trained recognizer $R$ and the text-level annotation of $I_{HR}$ . Suppose the encoded text-level annotation is $p_{GT}\in\mathbb{R}^{L\times|\mathcal{A}|}$ , where $L$ is the max prediction length of recognizer $R$ , and $|\mathcal{A}|$ denotes the length of the alphabet $\mathcal{A}$ . Then, the recognition loss can be evaluated by

\mathcal{L}_{rec}=-\sum_{j=0}^{L}p_{GT}^{j}log(R(I_{HQ})^{j}),

(3)

which is the cross entropy of $p_{GT}$ and $R(I_{HQ})$ . Beside the recognition loss, it is essential to keep the style of the enhanced images, which has also been pointed out in a recent work [8]. Though HR images are not trustworthy, pixel information from HR images can help the model to enhance the input images, rather than totally regenerate them, which is a much more challenging and uncontrollable task. In HiREN, we use mean-squared-error (MSE) to compute pixel loss to keep the style unchanged. Formally, we have

\mathcal{L}_{sty}=||I_{HQ},I_{HR}||_{2}.

(4)

With the recognition loss Eq. (3) and the style loss Eq. (4), the whole loss function of the HR enhancement branch can be written as follows:

\mathcal{L}_{HR}=\alpha\mathcal{L}_{rec}+\mathcal{L}_{sty},

(5)

where $\alpha$ is a hyper-parameter to trade-off the two losses.

III-D Quality Estimation Module

Though we can improve the quality of supervision information with the help of the HR enhancement branch, we cannot guarantee the correctness of the supervision information. Therefore, to suppress wrong supervision information, we design a quality estimation module $f_{QE}$ to evaluate the qualities of HQ images and weight the losses of HQ images according to their qualities.

Let the original loss of the LR branch be $\mathcal{L}_{o}\in\mathbb{R}^{B}$ , where $B$ denotes the batch size. We adopt the Levenshtein similarity [15] between the $i$ -th HQ image’s recognition result $pred_{i}$ of a recognizer $R$ and the corresponding ground truth $gt_{i}$ to measure its quality, and then utilize the quality values of all HQ images to compute the final loss:

\mathcal{L}_{LR}=\mathcal{L}_{o}[NS(pred_{1},gt_{1}),...,NS(pred_{B},gt_{B})]^{\top}/B,

(6)

where $NS(\cdot,\cdot)$ denotes the Levenshtein similarity, which has the following two advantages: 1) its value falls between 0 and 1; 2) it has a smooth response, thus can gracefully capture character-level errors [58]. These advantages make it suitable to weight the losses of HQ images.

Algorithm 1 The online usage of HiREN.

1:Input: Training dataset

\mathcal{D}

and a pretrained recognizer

R

2:Initialize

f_{HR}

and

f_{LR}

3:# Develop the HR enhancement branch

4:while

f_{HR}

is not converged do

I_{HR},p_{GT}\sim\mathcal{D}

I_{HQ}=f_{HR}(I_{HR})

7: Compute

\mathcal{L}_{rec}

via Eq. (3)

8: Compute

\mathcal{L}_{sty}

via Eq. (4)

\mathcal{L}_{HR}=\alpha\mathcal{L}_{rec}+\mathcal{L}_{sty}

10: Optimize

f_{HR}

via

\mathcal{L}_{HR}

11:while

f_{LR}

is not converged do

12:

I_{LR},I_{HR}\sim\mathcal{D}

13:

I_{HQ}=f_{HR}(I_{HR})

14:

I_{SR}=f_{LR}(I_{LR})

15: Compute

\mathcal{L}_{o}

according to

I_{SR}

and

I_{HQ}

16: Optimize

f_{LR}

with respect to

\mathcal{L}_{o}

17:return

f_{LR}

Algorithm 2 The offline usage of HiREN.

1:Input: Training dataset

\mathcal{D}

and the developed HR enhancement branch

f_{HR}

2:Initialize

f_{LR}

\hat{\mathcal{D}}=\emptyset

4:for

I_{LR},I_{HR}\sim\mathcal{D}

I_{HQ}=f_{HR}(I_{HR})

6: Add

(I_{HQ},I_{LR})

\hat{\mathcal{D}}

7:while

f_{LR}

is not converged do

I_{HQ},I_{LR}\sim\hat{\mathcal{D}}

I_{SR}=f_{LR}(I_{LR})

10: Compute

\mathcal{L}_{o}

according to

I_{SR}

and

I_{HQ}

11: Optimize

f_{LR}

with respect to

\mathcal{L}_{o}

12:return

f_{LR}

III-E The Usage of HiREN

In this section, we introduce the usage of HiREN. As mentioned above, there are two ways to deploy it. One way is called “online”, which can be easily implemented by plugged the HR enhancement branch to the training procedure of the LR recovery branch. The online installation algorithm of HiREN is given in Alg. 1. As shown in Alg. 1, the first thing we should do is to develop the HR enhancement branch (i.e., L4 $\sim$ L10). Specifically, given a STISR dataset $\mathcal{D}$ , we first sample HR images and their corresponding text-level annotations from $\mathcal{D}$ (L5), then generate the enhanced images $I_{HQ}$ (L6). Finally, recognition loss and style loss described in Sec. III-C4 are computed to optimize the loss $f_{HR}$ . After that, we plug the developed HR enhancement branch to the training procedure of the LR recover branch (L11 $\sim$ L16). In particular, after sampling LR and HR images from the dataset $\mathcal{D}$ (L12), we use the HR enhancement branch to generate the HQ image $I_{HQ}$ (L13). Finally, the HQ image, rather than the HR image used in typical works, and the SR image are utilized to compute the text-specific loss $\mathcal{L_{o}}$ to supervise the LR recovery branch (L11 $\sim$ L12).

The other way is called “offline”, which can be implemented by caching all the enhanced HQ images. As can be checked in Alg. 2, after developing the HR enhancement branch $f_{HR}$ , we sample all the LR-HR image pairs in the old dataset $\mathcal{D}$ . Then, the corresponding HQ images are generated and then add to the new dataset $\hat{\mathcal{D}}$ (L6). During training the LR recovery branch, what we need to do is to sample LR-HQ image pairs to compute the loss $L_{o}$ for the optimization of the model. Such an installation does not introduce any additional training cost to the LR recovery branch. It is worth mentioning that the HR enhancement branch is removed during inference. That is, HiREN does not introduce any additional inference cost.

IV Performance Evaluation

In this section, we first introduce the dataset and metrics used in the experiments and the implementation details. Then, we evaluate HiREN and compare it with several state-of-the-art techniques to show its effectiveness and superiority. Finally, we conduct extensive ablation studies to validate the design of our method.

IV-A Dataset and Metrics

Two groups of datasets are evaluated in this paper: low-resolution scene text dataset TextZoom and regular scene text recognition datasets.

IV-A1 Low-resolution scene text dataset

The TextZoom [7] dataset consists of 21,740 LR-HR text image pairs collected by lens zooming of the camera in real-world scenarios. The training set has 17,367 pairs, while the test set is divided into three settings based on the camera focal length: easy (1,619 samples), medium (1,411 samples), and hard (1,343 samples).

IV-A2 Regular STR datasets

These datasets are used to check the generalization power of our model trained on TextZoom when being adapted to other datasets. In particular, three regular STR datasets are evaluated in our paper to further check the advantage of HiREN: IC15-352 [8], SVT [59], and SVTP [60]. In what follows, we give brief introductions on these datasets.

The IC15-352 dataset is first divided in [8]. This dataset consists of 352 low-resolution images collected from the IC15 [61] dataset.

Street View Text (SVT) [59] is collected from the Google Street View. The test set contains 647 images. Many images in SVT are severely suffered from noise, blur, and low-resolution.

SVT-Perspective (SVTP) [60] is proposed for evaluating the performance of reading perspective texts. Images in SVTP are picked from the side-view images in Google Street View. Many of them are heavily distorted by the non-frontal view angle. This dataset contains 639 images for evaluation.

The major metric used in this paper is word-level recognition accuracy that evaluates the recognition performance of STISR methods. Following the settings of previous works [9], we remove punctuation and convert uppercase letters to lowercase letters for calculating recognition accuracy. Besides, Floating-point Operations Per Second (FLOPS) is used to evaluate the computational cost of various methods. Following [9, 32], we only report Peak Signal-to-Noise Ratio (PSNR) and Structure Similarity Index Measure (SSIM) [62] as the auxiliary metrics to evaluate the fidelity performance because of the quality issue of the HR images.

IV-B Implementation Details

All experiments are conducted on 2 NVIDIA Tesla V100 GPUs with 32GB memory. The PyTorch version is 1.8. The HR enhancement branch is trained using Adam [63] optimizer with a learning rate of 0.0001. The batch size $B$ is set to 48. The LR recovery branch is trained with the same optimizer and batch size but a higher learning rate of 0.001, which is suggested in [12]. The recognizer $R$ used in our method is proposed in [8]. The hyper-parameters in HiREN are set as follows: $\alpha$ is set to 0.1, which is determined through grid search. The number of SRB blocks is set to 5 (i.e., $N=5$ ) and $C^{\prime}$ is set to 32, which is the same as in [7]. The size of kernel $k$ is set to 32 (i.e., $d=32$ ), which is similar to that suggested in [52]. Our training and evaluation are based on the following protocol: save the averagely best model during training with CRNN as the recognizer, and use this model to evaluate the other recognizers (MORAN, ASTER) and the three settings (easy, medium, hard).

IV-C Performance Improvement on SOTA Approaches

TABLE II: Performance (recognition accuracy) improvement on TextZoom.

Method	CRNN [36]				MORAN [40]				ASTER [38]
Method	Easy	Medium	Hard	Average	Easy	Medium	Hard	Average	Easy	Medium	Hard	Average
LR	37.5%	21.4%	21.1%	27.3%	56.2%	35.9%	28.2%	41.1%	64.6%	42.0%	31.7%	47.2%
+HiREN	37.7%	27.9%	23.5%	30.2%	57.9%	38.2%	28.7%	42.6%	66.4%	43.4%	32.3%	48.5%
HR	76.4%	75.1%	64.6%	72.4%	89.0%	83.1%	71.1%	81.6%	93.4%	87.0%	75.7%	85.9%
+HiREN	77.5%	75.4%	65.0%	72.9%	88.8%	83.7%	71.9%	82.0%	93.5%	87.5%	76.2%	86.3%
SRCNN	39.8%	23.4%	21.7%	29.0%	57.7%	36.1%	28.5%	41.8%	65.5%	41.9%	31.7%	47.5%
+HiREN	41.6%	24.0%	23.7%	30.4%	61.1%	38.6%	29.3%	44.0%	67.5%	44.7%	32.8%	49.5%
TSRN	52.8%	39.8%	31.6%	42.1%	64.5%	49.3%	36.7%	51.1%	69.7%	54.8%	41.3%	56.2%
+HiREN	56.5%	44.1%	32.2%	45.0%	68.5%	52.5%	38.6%	54.2%	73.5%	56.3%	39.2%	57.4%
TG	60.5%	49.0%	37.1%	49.6%	72.0%	57.6%	40.0%	57.6%	76.0%	61.4%	42.9%	61.2%
+HiREN	62.4%	51.2%	37.5%	51.1%	73.4%	58.4%	41.0%	58.6%	77.5%	61.5%	43.0%	61.7%
TPGSR	63.1%	52.0%	38.6%	51.8%	74.9%	60.5%	44.1%	60.5%	78.9%	62.7%	44.5%	62.8%
+HiREN	63.5%	52.7%	38.8%	52.4%	74.7%	60.9%	44.1%	60.5%	78.3%	63.5%	45.6%	63.5%

IV-C1 Recognition performance improvement

Here, we evaluate our method on TextZoom. Since HiREN is a framework that can work with most existing methods, we plug HiREN to the training of several typical super-resolution methods to check the universality and effectiveness of HiREN, including one generic method SRCNN [20], two recent proposed STISR methods TSRN [7], TG [9], and one iterative-based and clue-guided STISR method TPGSR [12]. To show that HiREN can support various recognizers, we follow previous works [12, 8, 9] and evaluate the recognition accuracy on three recognizers: CRNN [36], MORAN [40] and ASTER [38]. We re-implement these methods to unify hardware, software, and evaluation protocols for fair comparison. Generally, our results are higher than those in the original papers. For example, with CRNN the averaged accuracy of TG is boosted from 48.9% to 49.6%. All the results are presented in Tab. II.

We first check the universality of HiREN. As can be seen in Tab. II, HiREN significantly boosts the recognition performance in almost all the cases, except for one case on TPGSR, which means that HiREN can work well with various existing techniques. As for the performance improvement of HiREN, taking a non-iterative method for example. The state-of-the-art TG [9] achieves 49.6%, 57.6% and 61.2% averaged accuracy respectively with the three recognizers (see the 9th row). After equipping our method HiREN, the accuracy is lifted to 51.1%, 58.6% and 61.7% (increasing by 1.5%, 1.0%, and 0.5%) respectively (see the 10th row). This demonstrates the effectiveness of our method. Results on more datasets and recognizers are given in the supplementary materials to demonstrate its universality.

It is worth mentioning that our HR enhancement branch can also be applied to weakly supervising the enhancement of LR and HR images to lift their recognition accuracies, as shown in the 3rd and 5th rows of Tab. II. This further supports the universality of our technique. Results above show the promising application potential of our method — not only work with STISR methods, but also pioneer weakly supervised enhancement of LR and HR text images.

TABLE III: Performance comparison on three STR datasets with CRNN as recognizer.

Method	IC15-352	SVT	SVTP
LR	49.4%	74.8%	60.8%
TSRN	48.9%	72.6%	61.4%
+HiREN	52.3%	74.8%	60.3%
TG	59.1%	74.2%	60.2%
+HiREN	61.7%	76.5%	60.5%
TPGSR	66.2%	77.4%	62.8%
+HiREN	66.8%	78.7%	63.6%

Furthermore, to better demonstrate the universality of HiREN, we conduct more experiments on more STR datasets and recently proposed STR datasets. We first evaluate our method on three STR datasets, including IC15-352, SVT, and SVTP. We use the STISR models (TSRN, TG, TPGSR, and our techniques performed on them) developed on the TextZoom dataset to evaluate these datasets. The experimental results on IC15-352, SvT, and SVTP are given in Tab. III. As shown in Tab. III, HiREN also works well on them and achieve lifted performance in almost all the cases. In particular, the performance of TPGSR on three datasets are lifted from 66.2%, 77.4%, 62.8% to 66.8%, 78.7%, and 63.6%, respectively, which demonstrates the advantage of HiREN.

TABLE IV: Performance of recent recognizers on TextZoom.

Method	SEED [46]	ABINet [48]
LR	45.8%	61.0%
HR	84.8%	89.8%
TSRN	56.3%	64.0%
+HiREN	56.5%	63.8%
TG	60.7%	66.0%
+HiREN	60.9%	65.9%
TPGSR	61.7%	67.5%
+HiREN	62.2%	68.1%

Apart from that, we also give the experimental results on more recently proposed recognizers including SEED [46] and ABINet [48]. The experimental results are given in Tab. IV. As can be checked in Tab. IV, these recent recognizers still find difficulty in recognizing low-resolution text images. For example, SEED and ABINet can only correctly read 45.8% and 61.0% of LR images, which are inferior to performance of reading HR images (i.e., 84.8% and 89.8%). Our method HiREN can also achieve boosted performance on these recognizers in almost all the cases.

TABLE V: Fidelity and recognition results on major existing methods. The results are obtained by averaging three settings (easy, medium and hard).

Method	Metrics
	SR-HR		SR-HQ		Avg
	PSNR	SSIM( $\times 10^{-2}$ )	PSNR	SSIM( $\times 10^{-2}$ )	Acc
LR	20.35	69.61	20.73	68.76	27.3%
TSRN	21.84	76.34	21.08	74.76	42.1%
+HiREN	22.01	76.60	21.46	76.23	45.0%
TG	21.47	73.57	20.89	72.59	49.6%
+HiREN	21.12	73.43	20.84	73.78	51.1%
TPGSR	22.05	76.71	21.05	76.77	51.8%
+HiREN	21.69	75.97	21.15	76.44	52.4%

IV-C2 Fidelity improvement

We also report the results of fidelity improvement (PSNR and SSIM) on major existing methods in Tab. V. Notice that these fidelity metrics have the following limitations. On the one hand, PSNR and SSIM globally measure the similarity between SR image and the ground truth image, including both characters and background. With the goal of lifting the recognition ability and readability of the scene text images, STISR should put more emphasis on recovering characters rather than the background [9, 32]. On the other hand, as pointed out by our paper, HR images are suffering various quality issues. Ergo, it is inappropriate to measure the pixel similarity between erroneous HR images whose pixels are not trustworthy. Therefore, we only present PSNR and SSIM as auxiliary metrics to roughly draw some conclusions.

Notice that existing methods utilize SR-HR image pairs to calculate PSNR and SSIM. However, as mentioned above, the HR images are suffering from quality issues. Hence, we additionally provide the fidelity results of calculating PSNR and SSIM between SR and HQ images. The experimental results are given in Tab. V. As can be seen in Tab. V, 1) A higher PSNR does not means a higher recognition accuracy. For example, the PSNR of TG in SR-HR is inferior to that of TSRN (i.e., 21.47 v.s. 21.84) but TG performs better on recognition accuracy (i.e., 49.6% v.s. 42.1%). The reason lies in that TG is a stroke-focused technique, focusing on recovering fine-grained stroke details rather than the whole image quality including background that is minor to recognition. This is consistent with the results in [9]. 2) Comparing with the original models, after applying HiREN, the SR-HQ fidelity performance of the new models are boosted in almost all cases. 3) HiREN gets a low performance on the PSNR and SSIM of SR-HR images but obtains an improved recognition performance, which supports the quality issue of HR images.

IV-C3 Visualization

Here, we visualize several examples in Fig. 4 to better demonstrate the performance of our technique. We can see that HiREN can help the existing methods to recover the blurry pixels better (see the 2nd $\sim$ 6th cases). In particular, a better “ee” in the 2nd and 3rd cases, ‘m’ in the 4th case, ‘f’ in the 5th case, and ‘e’ in the 6th case are obtained by our technique. Besides, in some extremely tough cases where even with the HR images the recognition is hard, HiREN can still achieve better recovery (see the 7th case). These results show the power of HiREN.

TABLE VI: The training and inference costs of our method. The cost is measured by the FLOPs(G).

Method	Metrics
Method	Training cost	Inference cost
TG	19.60	0.91
+HiREN(Online)	20.59	0.91
+HiREN(Offline)	19.60	0.91
TPGSR	7.20	7.20
+HiREN(Online)	8.19	7.20
+HiREN(Offline)	7.20	7.20

IV-C4 Training and inference cost

We have discussed the high performance of our technique above. In this section, we provide the results of training and inference costs to show the efficiency of HiREN. Specifically, We take TG and TPGSR as baselines and add HiREN to them and count their FLOPS during training and inference. The experimental results are presented in Tab. VI. In terms of training cost, we can see that the offline deployment of HiREN does not incur any additional cost. As for online version, we can see that the additional computational cost caused by HiREN is negligible (e.g, from 19.60G to 20.59G, only 0.99G). What is more, neither of the two variants introduce any additional inference cost. In conclusion, the offline deployment not only saves training and inference cost, but also significantly boosts the performance. These results validate the efficiency of our method.

IV-D Ablation Study

We conduct extensive ablation studies to validate the design of our method. Since our method is designed to enhance HR images during training, the metric used in this section is the recognition accuracy measured by the average accuracy of CRNN on training set, denoted as $Acc_{train}$ .

TABLE VII: The ablation studies of the HR enhancement branch. Here, ✗means the corresponding module is not applied, and Charb is Charbonnier Loss [64].

ID	Kernel-guided	Loss functions		$Acc_{train}$
ID	Kernel-guided	$\mathcal{L}_{rec}$	$\mathcal{L}_{sty}$	$Acc_{train}$
1	✗	✗	✗	66.9
2	✗	$\checkmark$	MSE	72.7
3	$\checkmark$	$\checkmark$	✗	66.1
4	$\checkmark$	✗	MSE	67.4
5	$\checkmark$	$\checkmark$	Charb	67.5
6	$\checkmark$	$\checkmark$	L1	67.3
7	$\checkmark$	$\checkmark$	MSE	74.1

IV-D1 Design of the HR enhancement branch

Here, we check the design of the HR enhancement branch. As mentioned above, two techniques are developed to promote the enhancement of HR images: kernel-guided enhancement network $f_{ke}$ and the loss $\mathcal{L}_{HR}$ . We conduct experiments to check their effects. The experimental results are presented in Tab. VII. Visualization of the effect of the HR enhancement branch is given in the supplementary materials.

The effect of the HR enhancement branch. Comparing the results in the 1st and 7th rows of Tab. VII, we can see that the HR enhancement branch lifts the accuracy from 66.9% to 74.1%, which proves the effect of the branch as a whole.

The effect of kernel-guided enhancement network. To check the power of the kernel-guided enhancement network, we design a variant that removes the kernel predictor. Comparing the results of the 2nd and 7th rows in Tab. VII, we can see that the variant without the kernel predictor is inferior to that with the kernel predictor (72.7% v.s. 74.1%). This demonstrates the effectiveness of the proposed kernel-guided enhancement network.

The design of loss function. Here, we check the design of the loss function used in the HR enhancement branch. We first remove the recognition loss $\mathcal{L}_{rec}$ and the style loss $\mathcal{L}_{sty}$ separately. As can be seen in the 3rd, 4th, and 7th rows in Tab. VII, comparing with the combined one, the performance of using only one single loss is degraded. Next, we check the selection of style loss. Specifically, we consider three candidates (MSE, Charbonnier and L1) for the style loss function. As can be seen in the 5th, 6th, and 7th rows of Tab. VII, MSE loss outperforms Charbonnier loss [64] and L1 loss. The reason lies in that MSE penalizes large errors and is more tolerant to small errors, which is more suitable for HiREN to enhance the blurry or missed character details and keep the style unchanged [65]. Ergo, MSE is selected as the style loss in HiREN.

TABLE VIII: The determination of

\alpha

. The metric is

Acc_{train}

Metric	$\alpha$
Metric	0.5	0.2	0.1	0.05	0.025	0.01	0.005
$Acc_{train}$	73.6	73.4	74.1	74.1	72.3	72.2	71.2

IV-D2 Hyper-parameter study

Here, we provide the grid search results of the hyper-parameter $\alpha$ introduced in HiREN for balancing the two losses. The results are presented in Tab. VIII. As can be seen in Tab. VIII, the best performance is achieved when $\alpha$ =0.1 and 0.05.

TABLE IX: Ablation study on the quality estimation module. The metric is the recognition accuracy of CRNN on the test set of TextZoom.

Method	SRCNN	TSRN	TG	TPGSR
without $f_{QE}$	30.2%	44.2%	51.0	51.9%
with $f_{QE}$	30.4%	45.0%	51.1	52.4%

IV-D3 The effect of loss quality estimation module

Here, we compare the performances of different models w/o the quality estimation module. As can be seen in Tab. IX, without $f_{QE}$ , all methods are degraded, which demonstrates the effect of the quality estimation module.

V Discussion

In this section, we discuss some issues to better demonstrate the advantages of HiREN and point out some limitations of the proposed method.

V-A Which kind of quality issues do HR images have?

We conduct a visualization study to demonstrate the quality issues of HR images. As can be checked in Fig. 5, HR images are suffering from including but not limited to low-contrast (1st, 2nd and 6th cases), blurry (3rd and 4th cases) and motion blur (5th case). These unknown degradations obviously threaten the recognition of HR images and subsequently provide erroneous supervision to the recovery of the LR images.

V-B How does HiREN lift the quality of supervision information?

To cope with various quality problems of HR images, HiREN generates HQ images through different strategies. In particular, HiREN makes the texts more prominent to solve low-contrast (e.g. the 1st and 2nd cases in Fig. 5). With respect to the blurry issue, HiREN makes the incorrectly recognized texts more distinguishable (e.g. “e” in the 3rd case and “ri” in the 4th case in Fig. 5). HiREN also tries to reduce the motion blur in the 5th case of Fig. 5. Although in some tough cases, HiREN fails to generate a correct HQ image (e.g. the 6th case in Fig. 5), our quality estimation module weights its loss to a small value to suppress the erroneous supervision information.

V-C Error Analysis

In this section, we perform an error analysis of HiREN to provide possible research directions for further works. Concretely, we provide some error cases in Fig. 6 to illustrate the limitations of recent works and HiREN. As can be seen in the 1st $\sim$ 2nd cases, recent methods usually rely on a vocabulary [66], which makes the models guess the blurry pixels via the corpus that can be learned from the training dataset. This degrades the models’ ability to recover numbers and punctuation. As a result, although HiREN recovers more characters than the original TPGSR, the word-level recovery still fails. Besides, as shown in the 3rd case, in some tough cases where the LR and HR images are extremely difficult to read, TPGSR and HiREN also fail to effectively do the recovery. This indicates the challenge of STISR.

V-D Limitations of HiREN

On the one hand, HiREN may introduce some noise to the HR images and worsen their quality. However, such noise is very minor compared to the advantage brought by HiREN. Specifically, we find that 9,565 erroneously recognized images in TextZoom dataset are successfully enhanced by HiREN, which leads to correct recognition results, while only 128 images are deteriorated from correct to wrong. On the other hand, the training of the HR enhancement branch requires the feedback of a scene text recognizer and text-level annotations. This indicates that HiREN still needs some weak supervision information for supervision.

VI Conclusion

In this paper, we present a novel framework called HiREN to boost STISR performance. Different from existing works, HiREN aims at generating high-quality text images based on high-resolution images to provide more accurate supervision information for STISR. Concretely, recognizing the difficulty in catching the degradation from HQ to HR and obtaining the supervision information from HR images, we explore degradation kernel-guided super-resolution and the feedback of a recognizer as well as text-level annotations as weak supervision to train a HR enhancement branch. What is more, to suppress erroneous supervision information, a novel quality estimation module is designed to evaluate the qualities of images, which are used to weight their losses. Extensive experiments demonstrate the universality, high-performance and efficiency of HiREN. Our work provides a new solution for the STISR task.

In the future, we will try to explore more advanced models to further advance the proposed technique. One the one hand, we will try to further improve the recovery ability of the HR enhancement branch or address the vocabulary reliance issue. On the other hand, we plan to apply HiREN to self-supervised or unsupervised settings when the recognizer and text-level annotations are not trustworthy or text-level annotations are lack during training. Last but not least, we will extend the idea of the proposed quality enhancement branch to build a new noisy learning algorithm for STISR.

References

[1] J. Chen, H. Yu, J. Ma, M. Guan, X. Xu, X. Wang, S. Qu, B. Li, and X. Xue, “Benchmarking chinese text recognition: Datasets, baselines, and an empirical study,” arXiv preprint arXiv:2112.15093, 2021.
[2] X. Chen, L. Jin, Y. Zhu, C. Luo, and T. Wang, “Text recognition in the wild: A survey,” ACM Computing Surveys (CSUR), vol. 54, no. 2, pp. 1–35, 2021.
[3] C. Zhang, W. Ding, G. Peng, F. Fu, and W. Wang, “Street view text recognition with deep learning for urban scene understanding in intelligent transportation systems,” IEEE Transactions on Intelligent Transportation Systems, 2020.
[4] A. Singh, V. Natarajan, M. Shah, Y. Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach, “Towards vqa models that can read,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 8317–8326.
[5] M. Mathew, D. Karatzas, and C. Jawahar, “Docvqa: A dataset for vqa on document images,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 2200–2209.
[6] M. Zhao, B. Li, J. Wang, W. Li, W. Zhou, L. Zhang, S. Xuyang, Z. Yu, X. Yu, G. Li et al., “Towards video text visual question answering: Benchmark and baseline,” in Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
[7] W. Wang, E. Xie, X. Liu, W. Wang, D. Liang, C. Shen, and X. Bai, “Scene text image super-resolution in the wild,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16. Springer, 2020, pp. 650–666.
[8] J. Chen, B. Li, and X. Xue, “Scene text telescope: Text-focused scene image super-resolution,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 12 026–12 035.
[9] J. Chen, H. Yu, J. Ma, B. Li, and X. Xue, “Text gestalt: Stroke-aware scene text image super-resolution,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 1, 2022, pp. 285–293.
[10] C. Zhao, S. Feng, B. N. Zhao, Z. Ding, J. Wu, F. Shen, and H. T. Shen, “Scene text image super-resolution via parallelly contextual attention network,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 2908–2917.
[11] S. Nakaune, S. Iizuka, and K. Fukui, “Skeleton-aware text image super-resolution,” in Proceedings of the 32nd British Machine Vision Conference, Online, 2021, pp. 22–25.
[12] J. Ma, S. Guo, and L. Zhang, “Text prior guided scene text image super-resolution,” IEEE Transactions on Image Processing, 2023.
[13] J. Ma, Z. Liang, and L. Zhang, “A text attention network for spatial deformation robust scene text image super-resolution,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 5911–5920.
[14] M. Zhao, M. Wang, F. Bai, B. Li, J. Wang, and S. Zhou, “C3-stisr: Scene text image super-resolution with triple clues,” in International Joint Conference on Artificial Intelligence, 2022, pp. 1707–1713.
[15] V. I. Levenshtein et al., “Binary codes capable of correcting deletions, insertions, and reversals,” in Soviet physics doklady, vol. 10, no. 8. Soviet Union, 1966, pp. 707–710.
[16] W. Yang, X. Zhang, Y. Tian, W. Wang, J.-H. Xue, and Q. Liao, “Deep learning for single image super-resolution: A brief review,” IEEE Transactions on Multimedia, vol. 21, no. 12, pp. 3106–3121, 2019.
[17] C. Tian, Y. Xu, W. Zuo, B. Zhang, L. Fei, and C.-W. Lin, “Coarse-to-fine cnn for image super-resolution,” IEEE Transactions on Multimedia, vol. 23, pp. 1489–1502, 2020.
[18] H. Li, J. Qin, Z. Yang, P. Wei, J. Pan, L. Lin, and Y. Shi, “Real-world image super-resolution by exclusionary dual-learning,” IEEE Transactions on Multimedia, 2022.
[19] Q. Jiang, Z. Liu, K. Gu, F. Shao, X. Zhang, H. Liu, and W. Lin, “Single image super-resolution quality assessment: a real-world dataset, subjective studies, and an objective metric,” IEEE Transactions on Image Processing, vol. 31, pp. 2279–2294, 2022.
[20] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep convolutional networks,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 2, pp. 295–307, 2015.
[21] X. Xu, D. Sun, J. Pan, Y. Zhang, H. Pfister, and M.-H. Yang, “Learning to super-resolve blurry face and text images,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 251–260.
[22] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang et al., “Photo-realistic single image super-resolution using a generative adversarial network,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2017, pp. 4681–4690.
[23] R. K. Pandey, K. Vignesh, A. Ramakrishnan et al., “Binary document image super resolution for improved readability and ocr performance,” arXiv preprint arXiv:1812.02475, 2018.
[24] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image super-resolution using very deep residual channel attention networks,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 286–301.
[25] T. Dai, J. Cai, Y. Zhang, S.-T. Xia, and L. Zhang, “Second-order attention network for single image super-resolution,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 11 065–11 074.
[26] W. Li, X. Lu, J. Lu, X. Zhang, and J. Jia, “On efficient transformer and image pre-training for low-level vision,” arXiv preprint arXiv:2112.10175, 2021.
[27] J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Timofte, “Swinir: Image restoration using swin transformer,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1833–1844.
[28] Z. Wang, X. Cun, J. Bao, W. Zhou, J. Liu, and H. Li, “Uformer: A general u-shaped transformer for image restoration,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 17 683–17 693.
[29] C. Fang, Y. Zhu, L. Liao, and X. Ling, “Tsrgan: Real-world text image super-resolution based on adversarial learning and triplet attention,” Neurocomputing, vol. 455, pp. 88–96, 2021.
[30] W. Wang, E. Xie, P. Sun, W. Wang, L. Tian, C. Shen, and P. Luo, “Textsr: Content-aware text super-resolution guided by recognition,” arXiv preprint arXiv:1909.07113, 2019.
[31] Y. Mou, L. Tan, H. Yang, J. Chen, L. Liu, R. Yan, and Y. Huang, “Plugnet: Degradation aware scene text recognition supervised by a pluggable super-resolution unit,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV 16. Springer, 2020, pp. 158–174.
[32] R. Qin, B. Wang, and Y.-W. Tai, “Scene text image super-resolution via content perceptual loss and criss-cross transformer blocks,” arXiv preprint arXiv:2210.06924, 2022.
[33] F. Bai, Z. Cheng, Y. Niu, S. Pu, and S. Zhou, “Edit probability for scene text recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2018, pp. 1508–1516.
[34] Z. Cheng, F. Bai, Y. Xu, G. Zheng, S. Pu, and S. Zhou, “Focusing attention: Towards accurate text recognition in natural images,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5076–5084.
[35] Z. Cheng, Y. Xu, F. Bai, Y. Niu, S. Pu, and S. Zhou, “Aon: Towards arbitrarily-oriented text recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2018, pp. 5571–5579.
[36] B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 11, pp. 2298–2304, 2016.
[37] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369–376.
[38] B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai, “Aster: An attentional scene text recognizer with flexible rectification,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 9, pp. 2035–2048, 2018.
[39] M. Jaderberg, K. Simonyan, A. Zisserman et al., “Spatial transformer networks,” Advances in neural information processing systems, vol. 28, 2015.
[40] C. Luo, L. Jin, and Z. Sun, “Moran: A multi-object rectified attention network for scene text recognition,” Pattern Recognition, vol. 90, pp. 109–118, 2019.
[41] W. Hu, X. Cai, J. Hou, S. Yi, and Z. Lin, “Gtc: Guided training of ctc towards efficient and accurate scene text recognition,” in AAAI, vol. 34, no. 07, 2020, pp. 11 005–11 012.
[42] F. Sheng, Z. Chen, and B. Xu, “Nrtr: A no-recurrence sequence-to-sequence model for scene text recognition,” in 2019 International conference on document analysis and recognition (ICDAR). IEEE, 2019, pp. 781–786.
[43] D. Yu, X. Li, C. Zhang, T. Liu, J. Han, J. Liu, and E. Ding, “Towards accurate scene text recognition with semantic reasoning networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 12 113–12 122.
[44] H. Zhang, Q. Yao, M. Yang, Y. Xu, and X. Bai, “Autostr: efficient backbone search for scene text recognition,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV 16. Springer, 2020, pp. 751–767.
[45] T. Elsken, J. H. Metzen, and F. Hutter, “Neural architecture search: A survey,” The Journal of Machine Learning Research, vol. 20, no. 1, pp. 1997–2017, 2019.
[46] Z. Qiao, Y. Zhou, D. Yang, Y. Zhou, and W. Wang, “Seed: Semantics enhanced encoder-decoder framework for scene text recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 13 528–13 537.
[47] R. Atienza, “Vision transformer for fast and efficient scene text recognition,” in International Conference on Document Analysis and Recognition. Springer, 2021, pp. 319–334.
[48] S. Fang, H. Xie, Y. Wang, Z. Mao, and Y. Zhang, “Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 7098–7107.
[49] Y. Wang, H. Xie, S. Fang, J. Wang, S. Zhu, and Y. Zhang, “From two to one: A new scene text recognizer with visual language modeling network,” in Proceedings of the IEEE International Conference on Computer Vision, 2021, pp. 14 194–14 203.
[50] Y. Du, Z. Chen, C. Jia, X. Yin, T. Zheng, C. Li, Y. Du, and Y.-G. Jiang, “Svtr: Scene text recognition with a single visual model,” arXiv preprint arXiv:2205.00159, 2022.
[51] D. Bautista and R. Atienza, “Scene text recognition with permuted autoregressive sequence models,” in Proceedings of the 17th European Conference on Computer Vision (ECCV). Cham: Springer International Publishing, 10 2022.
[52] J. Gu, H. Lu, W. Zuo, and C. Dong, “Blind super-resolution with iterative kernel correction,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 1604–1613.
[53] K. Zhang, W. Zuo, and L. Zhang, “Learning a single convolutional super-resolution network for multiple degradations,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2018, pp. 3262–3271.
[54] T. Michaeli and M. Irani, “Nonparametric blind super-resolution,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 945–952.
[55] Y. Yuan, S. Liu, J. Zhang, Y. Zhang, C. Dong, and L. Lin, “Unsupervised image super-resolution using cycle-in-cycle generative adversarial networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 701–710.
[56] M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv preprint arXiv:1312.4400, 2013.
[57] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
[58] A. F. Biten, R. Tito, A. Mafla, L. Gomez, M. Rusinol, E. Valveny, C. Jawahar, and D. Karatzas, “Scene text visual question answering,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 4291–4301.
[59] K. Wang, B. Babenko, and S. Belongie, “End-to-end scene text recognition,” in Proceedings of the IEEE International Conference on Computer Vision. IEEE, 2011, pp. 1457–1464.
[60] T. Q. Phan, P. Shivakumara, S. Tian, and C. L. Tan, “Recognizing text with perspective distortion in natural scenes,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 569–576.
[61] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu et al., “Icdar 2015 competition on robust reading,” in 2015 13th international conference on document analysis and recognition (ICDAR). IEEE, 2015, pp. 1156–1160.
[62] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
[63] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[64] P. Charbonnier, L. Blanc-Feraud, G. Aubert, and M. Barlaud, “Two deterministic half-quadratic regularization algorithms for computed imaging,” in Proceedings of 1st International Conference on Image Processing, vol. 2. IEEE, 1994, pp. 168–172.
[65] H. Zhao, O. Gallo, I. Frosio, and J. Kautz, “Loss functions for image restoration with neural networks,” IEEE Transactions on computational imaging, vol. 3, no. 1, pp. 47–57, 2016.
[66] Z. Wan, J. Zhang, L. Zhang, J. Luo, and C. Yao, “On vocabulary reliance in scene text recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 425–11 434.