TextDiff: Mask-Guided Residual Diffusion Models for Scene Text Image Super-Resolution

Baolin Liu\equalcontrib¹², Zongyuan Yang\equalcontrib¹², Pengfei Wang¹, Junjie Zhou¹, Ziqi Liu¹, Ziyi Song¹, Yan Liu¹, Yongping Xiong¹ Corresponding Author

Abstract

The goal of scene text image super-resolution is to reconstruct high-resolution text-line images from unrecognizable low-resolution inputs. The existing methods relying on the optimization of pixel-level loss tend to yield text edges that exhibit a notable degree of blurring, thereby exerting a substantial impact on both the readability and recognizability of the text. To address these issues, we propose TextDiff, the first diffusion-based framework tailored for scene text image super-resolution. It contains two modules: the Text Enhancement Module (TEM) and the Mask-Guided Residual Diffusion Module (MRD). The TEM generates an initial deblurred text image and a mask that encodes the spatial location of the text. The MRD is responsible for effectively sharpening the text edge by modeling the residuals between the ground-truth images and the initial deblurred images. Extensive experiments demonstrate that our TextDiff achieves state-of-the-art (SOTA) performance on public benchmark datasets and can improve the readability of scene text images. Moreover, our proposed MRD module is plug-and-play that effectively sharpens the text edges produced by SOTA methods. This enhancement not only improves the readability and recognizability of the results generated by SOTA methods but also does not require any additional joint training. Available Codes:https://github.com/Lenubolim/TextDiff.

Introduction

Unlike optical character recognition (OCR), scene text image recognition poses persistent challenges owing to factors such as distortion, blurring, and other imaging problems (Long, He, and Yao 2021). Therefore, it is necessary to improve the quality of scene text images.

In the past few years, numerous natural image super-resolution methods (Wang et al. 2018; Ma, Gong, and Yu 2022; Whang et al. 2022) have been proposed, yet the performance on scene text images remains unsatisfactory. The images produced by these methods exhibit distorted text edges and severe artifacts, which may lead to recognition errors or even render certain text unrecognizable. The key challenge lies in the fact that low-resolution (LR) text images lack crucial textual details, making it arduous to accurately map them to high-resolution (HR) images with significant variations in content, font, and size, solely relying on low-level features. Hence, it becomes crucial to effectively restore text-level information, especially intricate details along the text edges of scene text images.

To achieve this goal, there are many methods for scene text image super-resolution (STISR) that consider text attributes. The pioneering work, TSRN (Wang et al. 2020) achieves impressive results by capturing sequential character information. Additionally, recent works attempt to incorporate prior knowledge, such as utilizing text priors from recognizers as clues to guide super-resolution. For instance, TPGSR (Ma, Guo, and Zhang 2023) and TATT (Ma, Liang, and Zhang 2022) use recognition outputs from CRNN (Shi, Bai, and Yao 2016) to guide text reconstruction. TSEPGNet (Huang et al. 2023) extracts text embedding and structure priors from upsampled images as auxiliary information to restore clear text.

Despite the considerable progress made by existing methods, there is still room for exploration in STISR. Firstly, text regions deserve greater focus compared to backgrounds. The quality of text region restoration has a substantial impact on text recognition accuracy. Incomplete recovery of certain text segments can lead to recognition errors, especially when dealing with characters that have similar visual features. As shown in Figure1, TATT (Ma, Liang, and Zhang 2022) fails to entirely restore the content of the text region, such as the letters ‘R’ and ‘E’ in the figure. However, the utilization of text masks encoding location and global structure remains underexplored. Secondly, existing methods suffer from text edge distortion. As discussed in (Saharia et al. 2022) and (Yang et al. 2023), “regression to the mean” can induce text edge distortion in most pixel-loss-based methods. The image restored by TSRN (Wang et al. 2020) can be correctly recognized by recognition models in Figure 1. However, from a human visual perception standpoint, the letter ’M’ exhibits distortion, and phantom contours are apparent around the letters. Fortunately, diffusion-based methods in natural image super-resolution have shown impressive performance in recovering fine details and reconstructing global structure, without encountering issues like mode collapse or training instability that are commonly seen in Generative Adversarial Networks (GANs) (Saharia et al. 2022; Li et al. 2022; Rombach et al. 2022; Ravuri and Vinyals 2019; Gulrajani et al. 2017). However, the diffusion model exhibits generative diversity, making it prone to generating images inconsistent with given conditions, thereby posing challenges in generating a fixed text structure (Yang et al. 2023). Additionally, the multi-step sampling in the inference of diffusion models can result in substantial time consumption.

To address these issues above, we propose TextDiff, a novel framework consisting of two modules: the Text Enhancement Module (TEM) and the Mask-Guided Residual Diffusion Module (MRD). The TEM comprises two branches. The first branch integrates semantic information into a coarsely deblurred network, enabling it to capture the global coarse textual structure and yield a coarse output. The second branch focuses on explicit text mask learning. This approach leads to the generation of more accurate masks that exhibit high sensitivity towards textual regions. Examples of the generated text masks are illustrated in Figure 2. The MRD module leverages diffusion models to learn the residual distribution between ground-truth images and coarsely deblurred images. Firstly, by predicting the residuals instead of the added noise, we effectively reduce the required number of sampling steps. Secondly, by incorporating text masks and images as input, we provide valuable guidance for residual learning, alleviating text edge distortion and minimizing the generation of diverse outputs. It is worth noting that during inference, we adopt a deterministic sampling scheme to strike a better balance between minimizing distortion and preserving perceptual quality.

Experimental results demonstrate the effectiveness of TextDiff, which shows strong competitiveness with a maximum improvement of 2.3% in recognition accuracy compared to the SOTA method, while achieving competitive visual quality on the TextZoom dataset. More importantly, TextDiff still achieves SOTA performance with only 5 sampling steps. Ablation experiments demonstrate the effectiveness and necessity of each component in TextDiff.

Our contributions are summarized as follows:

•

We introduce TextDiff, which is the first framework in the field of scene text image super-resolution to leverage diffusion models.
•

TextDiff addresses text edge distortions and blurriness, resulting in more natural image restoration and better preservation of text structure consistency between reconstructed and high-resolution (HR) images. Moreover, our plug-and-play MRD module effectively enhances the performance of SOTA methods by sharpening the text edges they generate, without additional joint training.
•

Adequate ablation studies and comparative experiments show that TextDiff achieves SOTA performance on STISR. Additional analysis further confirms the generalizability of TextDiff.

Related Works

Single Image Super-Resolution

Single image super-resolution (SISR) is a fundamental low-level task in computer vision, which goal is to generate HR images from LR inputs. As the seminal work of image super-resolution, SRCNN (Dong et al. 2014) achieves good performance through a simple convolutional network. Besides, in order to further improve the quality of super-resolution (SR) outputs, various methods (Liang et al. 2021; Saharia et al. 2022; Chen et al. 2023) have been proposed. Among them, methods based on the deep generative model, mainly including GAN-based (Wang et al. 2018; Soh et al. 2019; Wang et al. 2021) and diffusion-based methods (Saharia et al. 2022; Whang et al. 2022), have shown convincing image generation ability. However, the method for SISR tasks is not suitable for STISR. The key is that the method of SISR does not consider the structural characteristics of scene text.

Scene Text Image Super-Resolution

Different from SISR, STISR aims to improve the quality of the image while paying attention to the recovery of the text structure. TSRN proposes a real scene text SR dataset, and uses CNN-BiLSTM layers to perceive the sequential information of the text. Following it, (Chen, Li, and Xue 2021) proposes a Transformer-Based Super-Resolution Network (TBSRN) to extract sequential information, designs a Position-Aware Module and a Content-Aware Module to highlight the position and the content of each character. In addition, TPGSR (Ma, Guo, and Zhang 2023) obtains the character probability sequence through a text recognition model and merges it with image feature by convolutions. Different from these, by combining the text mask and graphic recognition results of LR text images, DPMN (Zhu et al. 2023) proposes a plug-and-play module, which can improve the performance of existing models. While these methods have made significant advances, they do not yet fully address issues like text distortion and blurring, which can impair image readlibity. Therefore, our method aims to resolve these specific problems.

Diffusion Models

Diffusion models are latent variable generative frameworks in (Sohl-Dickstein et al. 2015). Recent work has demonstrated the significant potential of diffusion models in SISR. For example, (Li et al. 2022) exploits a Markov chain to convert HR images to latents in simple distribution and then generate SR predictions in the reverse process. However, to the best of our knowledge, diffusion models have not yet been used in STISR. Therefore, in this paper, we explore the performance of diffusion models on STISR for the first time.

Methodology

In this section, we first provide an overview of the proposed scene text image super-resolution network based on the conditional diffusion model. Then we delve into a detailed description of the working mechanism and role of the conditional diffusion model. We will also introduce the training objective of the proposed network.

Overall Architecture

The overall framework of TextDiff is illustrated in Figure 3. First, the LR image is input into the Text Enhancement Module (TEM). The TEM consists of two branches, $\textrm{B}_{T}$ and $\textrm{B}_{M}$ . $\textrm{B}_{T}$ outputs a coarsely deblurred image, and $\textrm{B}_{M}$ learns to predict text masks. Then, guided by the mask, the Mask-Guided Residual Diffusion Module (MRD) learns the distribution of residuals between the ground-truth image and the coarsely deblurred image. Finally, this predicted residual is added to the coarsely deblurred image to obtain the final output.

Text Enhancement Module

As shown in Figure 3, the branch $\textrm{B}_{T}$ is a deblurring network incorporated with semantic information. The front part, called Semantic Block, adopts TP Generator (TPG) from TATT (Ma, Liang, and Zhang 2022) to extract semantic priors, which are then fused with feature maps through TP Interpreter (TPI) in TATT. The addition of this part can effectively integrate the semantic features of the text with the spatial distribution features of the text. The fused results are finally input into the Double Sequential Residual Block (DSRB). DSRB introduces wavelet transform on the basis of SRB (Wang et al. 2020) to realize simultaneous learning of spatial and frequency information. This aims to complement the spatial domain details with frequency domain information to generate an output with richer frequency details.

Additionally, the blurred regions in the scene text image mainly focus on the text. We need a text mask to separate the text from the background, and then use the mask to focus the network on the text area. So the branch $\textrm{B}_{M}$ is for mask prediction, implemented by a conventional convolutional network (see Figure 3). And the ground-truth masks are simply generated by calculating the average gray scale of the RGB images.

We define the Gradient Profile Loss as the pixel loss between the $\textrm{B}_{T}$ output and the ground-truth image to recover the approximate information of the global text structure in the LR image:

\mathcal{L}_{GP}=\mathrm{E}_{x}||\nabla x_{gt}-\nabla x_{sr}||_{1}

(1)

where $\nabla x_{gt}$ denotes the gradient field of HR images, and $\nabla x_{sr}$ denotes that of SR images.

Meanwhile, we define the loss for mask learning as dice loss (Wang et al. 2019). So the $\mathcal{L}_{Mask}$ is used to consider the contour similarity between the $\textrm{B}_{M}$ output $x_{m}$ and the ground-truth text mask $x_{gt_{m}}$ . Dice loss can be calculated as the following Eq.2 and Eq.3:

Dice(P,G)=\frac{2\times{\textstyle\sum_{x,y}}(P_{x,y}\times G_{x,y})}{\sum_{x,y}(P_{x,y})^{2}+\sum_{x,y}(G_{x,y})^{2}}

(2)

\mathcal{L}_{Mask}=1-Dice(P,G)

(3)

$P_{x,y}$ and $G_{x,y}$ represent the pixel value (x, y) of the predicted masks and the ground-truths, respectively.

Mask-Guided Residual Diffusion Module

After the LR image is processed by the TEM, $x_{sr}$ and a text mask are obtained. As shown in Figure 3, $x_{sr}$ has restored the general outline and color of the text and other visual features, but compared with the ground-truth image, the edges of the text are still blurred and partially distorted. The residual $x_{res}$ between the $x_{sr}$ and $x_{gt}$ (i.e., $x_{res}=x_{gt}-x_{sr}$ ) delineates the text outline, as depicted in Figure 3. Thus, we leverage the diffusion model to refine the text contour under the guidance of the text mask to make it more accurate and natural. Experiments show that this residual modeling effectively alleviates the limitations of the regression model in learning text contours, enabling the generation of more perceptually pleasing text.

Our MRD consists of a diffusion process of progressively adding Gaussian noise and a reverse process of learning residual distributions for denoising.

Specifically, the diffusion process starts from the residual image $x_{res}$ , also denoted as $x_{0}$ . Then it repeatedly adds Gaussian noise according to the transition kernel $q(x_{t}\mid x_{t-1})$ . And at the maximum time step t, we obtain $x_{T}$ which is pure Gaussian noise:

q(x_{1},...,x_{T}\mid x_{0}):=\prod_{t=1}^{T}q(x_{t}\mid x_{t-1})

(4)

q(x_{t}\mid x_{t-1}):=\mathcal{N}(x_{t};\alpha_{t}x_{t-1},(1-\alpha_{t})\mathrm{I})

(5)

The noise schedule $\alpha_{t}$ is a pre-chosen hyperparameter that controls the variance of noise added at each step. Setting $\alpha_{t}\in(0,1)$ for all $t=1,...,T$ , $\alpha_{0}=1$ , $\bar{\alpha}_{t}={\textstyle\prod_{i=0}^{t}}\alpha_{i}$ , the diffusion process allows sampling $x_{t}$ at an arbitrary timestep t in closed form:

x_{t}(x_{0},z)=\sqrt{\bar{\alpha}_{t}}x_{t}+\sqrt{1-\bar{\alpha}_{t}}z,z\sim\mathcal{N}(0,\mathrm{I})

(6)

q(x_{t}\mid x_{0})=\mathcal{N}(x_{t};\sqrt{\bar{\alpha}_{t}}x_{0},(1-\bar{\alpha}_{t})\mathrm{I})

(7)

The reverse process converts noise $x_{T}$ back into data distribution $x_{0}$ conditioned on $x_{sr}$ and $x_{m}$ . We adopt a deterministic manner to conduct the process:

q(x_{t-1}\mid x_{t},x_{0})=\mathcal{N}(x_{t-1};\mu_{t}(x_{t},x_{0}),0)

(8)

where $\mu_{t}(x_{t},x_{0})$ is calculated as:

\mu_{t}(x_{t},x_{0})=\sqrt{\bar{\alpha}_{t-1}}x_{0}+\sqrt{1-\bar{\alpha}_{t-1}}\cdot\frac{x_{t}-\sqrt{\bar{\alpha}_{t}}x_{0}}{\sqrt{1-\bar{\alpha}_{t}}}

(9)

Furthermore, with $x_{0}$ as an unknown parameter, the reverse diffusion step can be implemented by substituting the estimate $f_{\theta}$ in place of $x_{0}$ .

p_{\theta}(x_{t-1}\mid x_{t},x_{sr},x_{m})=q(x_{t-1}\mid x_{t},f_{\theta}(x_{t},t,x_{sr},x_{m}))

(10)

In Eq.10, $x_{m}$ plays an important role. Looking the U-Net structure in Figure 3, we introduced gated dilated convolution (Yang, Xiong, and Wu 2023). When processing the input image depth features and text mask, the gating mechanism learns to generate a gating vector that places more attention on the text regions, while dilated convolution enlarges the receptive field. The combination of both allows effective fusion of multi-scale features and enhances the restoration effect for text structure.

Moreover, $f_{\theta}$ predicts $x_{0}$ instead of noise. The rationale behind this includes: first, cascading both $x_{sr}$ and $x_{m}$ as conditional inputs allows $f_{\theta}$ to emphasize textual regions when predicting $x_{0}$ ; second, predicting noise and $x_{0}$ can be converted via Eq.6; third, predicting $x_{0}$ can better exploit $x_{sr}$ and $x_{m}$ as guiding conditions compared to predicting noise, which only depends on $x_{t}$ . Meanwhile, predicting noise is more likely to cause diversity that compromises textual structure.

In practice, $f_{\theta}$ ensures that the learned conditional distribution $p_{\theta}(x_{t-1}\mid x_{t},x_{sr},x_{m})$ approximates the true reverse diffusion step $q(x_{t-1}\mid x_{t},x_{0})$ as closely as possible. Hence, the training objective is:

\mathcal{L}_{\mathrm{DM}}=\mathrm{E}\parallel x_{0}-f_{\theta}(\sqrt{\bar{\alpha}_{t}}x_{res}+\sqrt{1-\bar{\alpha}_{t}}z,t,x_{sr},x_{m})\parallel_{2}

(11)

Besides, in order to better restore the contour edges of text, we extract textual structural edge information from the $f_{\theta}(x_{t},t,x_{sr},x_{m})$ and $x_{0}$ using a Laplacian kernel $f_{Edge}$ to compute the edge loss. This encourages the model to reconstruct sharper and more coherent textual contours:

\mathcal{L}_{Edge}=\mathrm{E}\parallel f_{Edge}(x_{0})-f_{Edge}(f_{\theta}(x_{t},t,x_{sr},x_{m}))\parallel_{2}

(12)

Training Objective

Algorithm 1 Training

Input: LR image and its corresponding HR image pairs $P=\left\{(x^{k}_{L},x^{k}_{H})\right\}_{k=1}^{K}$ , the HR mask $x_{{gt}_{m}}$
Parameter: total diffusion step $T$ , the predicted mask $x_{m}$ , Text Enhancement Module $\textrm{B}_{T}$ and $\textrm{B}_{M}$ , denoiser $f_{\theta}$ , noise schedule $\alpha_{0:T}$

1: while not converged do

2: Sample

(x_{L},x_{H})\sim P

x_{sr}

\textrm{B}_{T}(x_{L})

, compute

x_{res}

x_{H}-x_{L}

x_{m}

\textrm{B}_{M}(x_{L})

5: Sample

z\sim\mathcal{N}(0,\mathrm{I})

, and

t\sim\mathrm{Uniform}(\{1,...,T\})

6: Take a gradient step on

x_{t}=\sqrt{\bar{\alpha}_{t}}x_{res}+\sqrt{1-\bar{\alpha}_{t}}z

\mathcal{L}_{total}(\textrm{B}_{T},\textrm{B}_{M},f_{\theta};x_{H},x_{sr},x_{m},x_{gt_{m}},x_{res})

7: end while

Algorithm 2 Inference

Input: LR image $x_{L}$ , total diffusion step $T$
Parameter: Text Enhancement Module $\textrm{B}_{T}$ and $\textrm{B}_{M}$ , denoiser $f_{\theta}$ , noise schedule $\alpha_{0:T}$ , the predicted mask $x_{m}$
Output: High-resolution image generated by TextDiff

1: Sample

(x_{T})\sim\mathcal{N}(0,\mathrm{I})

x_{sr}

\textrm{B}_{T}(x_{L})

x_{m}

\textrm{B}_{M}(x_{L})

4: for

t=T:1

x_{res}=f_{\theta}(x_{sr},x_{m},x_{t},t)

x_{t-1}=\sqrt{\bar{\alpha}_{t-1}}x_{res}+\frac{\sqrt{1-\bar{\alpha}_{t-1}}(x_{t}-\sqrt{\bar{\alpha}_{t}}x_{res})}{\sqrt{1-\bar{\alpha}_{t}}}

7: end for

8: return

x_{sr}+x_{0}

as SR prediction

The overall loss function for training the network is comprised of three parts: the loss from the TEM, the loss from the MRD, and a joint loss $\mathcal{L}_{joint}$ combining both modules.

\mathcal{L}_{TEM}=\lambda_{1}\mathcal{L}_{GP}+\lambda_{2}\mathcal{L}_{Mask}+\mathcal{L}_{TP}

(13)

\mathcal{L}_{MRD}=\mathcal{L}_{Edge}+\mathcal{L}_{\mathrm{DM}}

(14)

where $\mathcal{L}_{TP}$ is a text prior loss for TPG (Ma, Liang, and Zhang 2022), $\lambda_{1}$ and $\lambda_{2}$ are hyperparameters. Specifically, the joint loss of the two modules is formulated as:

\mathcal{L}_{joint}=\sum_{i=1}^{N}\mathrm{E}\parallel R_{i}(x_{sr}+f_{\theta}(x_{t},t,x_{sr},x_{m}))-R_{i}(x_{gt})\parallel_{1}

(15)

where $R_{i}$ denotes the feature map output from the CRNN (Shi, Bai, and Yao 2016) after the i-th activation layer. Since CRNN is an optical character recognition model capable of perceiving textual patterns, we utilize its publicly available pre-trained weights to extract features without additional training on TextZoom. Incorporating this CRNN-based loss allows the network to better restore textual content.

Thus, the overall loss function is

\mathcal{L}_{total}=\mathcal{L}_{TEM}+\mathcal{L}_{MRD}+\lambda\mathcal{L}_{joint}

(16)

where $\lambda$ is a hyperparameter.

The training and inference procedures are presented in Algorithm 1 and Algorithm 2, respectively.

Experiments and Results

Datasets and Implementation Details

We conduct training and evaluation on the TextZoom (Wang et al. 2020) which is collected in real-world scenarios. This dataset consists of 17,367 LR-HR image pairs in the training set. The test set is divided into easy, medium, and hard subsets comprising 1,619, 1,411, and 1,343 LR-HR pairs respectively, based on the camera focal length. The size of LR images is $16\times 64$ , while the size of HR images is $32\times 128$ .

Our model is implemented with PyTorch 2.0 deep learning library and all the experiments are conducted on one RTX 4090 GPU. AdamW (Loshchilov and Hutter 2017) is utilized as the optimizer with a learning rate of $1\times 10^{-4}$ , and the batch size is set to 16. The total time steps $T$ are set to 200. The number of training iterations is one million. We use a linear increase in $\beta_{1:T}$ from $1\times 10^{-6}$ to $1\times 10^{-2}$ , $\alpha_{t}=1-\beta_{t}$ . Moreover, the two stages of the proposed network are trained jointly, where the weight coefficient $\lambda$ for the joint loss is set to 5. The weights for $\mathcal{L}_{GP}$ and $\mathcal{L}_{Mask}$ are set to 0.5 and 3 in Eq.13, respectively. Additional model configuration information is given in the supplementary material.

Metrics and Experimental Results

	ASTER(Shi et al. 2018)				MORAN(Luo, Jin, and Sun 2019)				CRNN(Shi, Bai, and Yao 2016)
Method	Easy	Medium	Hard	Average	Easy	Medium	Hard	Average	Easy	Medium	Hard	Average
BICUBIC	67.4%	42.4%	31.2%	48.2%	60.6%	37.9%	30.8%	44.1%	36.4%	21.1%	21.1%	26.8%
SRCNN,TPAMI2015	70.6%	44.0%	31.5%	50.0%	63.9%	40.0%	29.4%	45.6%	41.1%	22.3%	22.0%	29.2%
SRResNet,CVPR2017	69.4%	50.5%	35.7%	53.0%	66.0%	47.1%	33.4%	49.9%	45.2%	32.6%	25.5%	35.1%
TSRN,ECCV2020	73.4%	56.3%	40.1%	57.7%	70.1%	53.3%	37.9%	54.8%	54.8%	41.3%	32.3%	43.6%
TBSRN,CVPR2021	75.7%	59.9%	41.6%	60.0%	74.1%	57.0%	40.8%	58.4%	59.6%	47.1%	35.3%	48.1%
PCAN,MM2021	77.5%	60.7%	43.1%	61.5%	73.7%	57.6%	41.0%	58.5%	59.6%	45.4%	34.8%	47.4%
TPGSR,TIP2023	77.0%	60.9%	42.4%	60.9%	72.2%	57.8%	41.3%	57.8%	61.0%	49.9%	36.7%	49.8%
TPGSR-3,TIP2023	78.9%	62.7%	44.5%	62.8%	74.9%	60.5%	44.1%	60.5%	63.1%	52.0%	38.6%	51.8%
DocDiff,MM2023	69.8%	56.2%	39.8%	56.2%	64.5%	50.5%	35.6%	51.2%	52.5%	41.3%	31.2%	42.4%
TATT,CVPR2022	78.5%	63.3%	45.0%	63.3%	72.9%	61.0%	43.8%	60.1%	64.3%	54.2%	39.1%	53.3%
C3-STISR,IJCAI2022	79.1%	63.3%	46.8%	64.1%	74.2%	61.0%	43.2%	60.5%	65.2%	53.6%	39.8%	53.7%
DPMN(+TATT),AAAI2023	79.3%	64.1%	45.2%	63.9%	73.3%	61.5%	43.9%	60.5%	64.4%	54.2%	39.3%	53.4%
Ours(TextDiff-5)	$79.8\%^{+0.5}$	$65.0\%^{+0.9}$	$47.2\%^{+0.4}$	$65.0\%^{+0.9}$	$76.9\%^{+2.0}$	$61.7\%^{+0.2}$	$44.4\%^{+0.3}$	$62.0\%^{+1.5}$	64.7%	$54.9\%^{+0.7}$	$40.2\%^{+0.4}$	$54.0\%^{+0.3}$
Ours(TextDiff-200)	$80.8\%^{+1.5}$	$66.5\%^{+2.4}$	$48.7\%^{+1.9}$	$66.4\%^{+2.3}$	$77.7\%^{+2.8}$	$62.5\%^{+1.0}$	$44.6\%^{+0.5}$	$62.7\%^{+2.2}$	$64.8\%^{-0.4}$	$55.4\%^{+1.2}$	$39.9\%^{+0.1}$	$54.2\%^{+0.5}$
HR	94.2%	87.7%	76.2%	86.6%	91.2%	85.3%	74.2%	84.1%	76.4%	75.1%	64.6%	72.4%

Table 1: Comparison with state-of-the-art SR methods on three subsets of the TextZoom testsets. The best performance is highlighted in red (1st best) and blue (2nd best). Superscript represents the relative distance of our methods on this indicator compared to the highest value of this indicator among all existing methods. ‘-3’ means multi-stage settings in (Ma, Guo, and Zhang 2023). TextDiff-n means applying n-step sampling(

T

	Metric
Method	PSRN $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	MANIQA $\uparrow$
BICUBIC	20.35	0.6961	0.2199	0.4413
TSRN	21.42	0.7691	0.0973	0.4533
TATT	21.52	0.7930	0.0945	0.4493
C3-STISR	21.60	0.7931	0.0944	0.4521
TextDiff	21.37	0.7841	0.0822	0.4589

Table 2: Evaluation of competitive STISR models on three subsets of the TextZoom testset. The bold numbers denote the best score.

Following the common practice in TSRN, to measure the downstream task performance, we calculate the text recognition accuracy for the text recognition task. Additionally, We evaluate the quality of the SR images using metrics including the no-reference image quality assessment method MANIQA (Yang et al. 2022), the full-reference method LPIPS (Zhang et al. 2018), PSNR and SSIM.

We evaluate our TextDiff and compare it with existing super-resolution models, including SRCNN (Dong et al. 2015), SRResNet (Ledig et al. 2017), TSRN (Wang et al. 2020), TBSRN (Chen, Li, and Xue 2021), PCAN (Zhao et al. 2021), TPGSR (Ma, Guo, and Zhang 2023), DocDiff (Yang et al. 2023), TATT (Ma, Liang, and Zhang 2022), C3-STISR (Zhao et al. 2022) and DPMN (Zhu et al. 2023).

As shown in Table 1, we can clearly observe that our TextDiff achieves a significant improvement in accuracy compared to existing methods. On average, our method improves recognition accuracy by 2.3%, 2.2% and 0.5% on ASTER, MORAN and CRNN, respectively. Furthermore, TextDiff outperforms all existing methods in recognition accuracy with only 5-step sampling (see the quantitative results taking other sampling steps in the supplementary material.). The consistent accuracy gains on multiple models validate the effectiveness of our TextDiff. For comprehensiveness, we also compare TextDiff with SOTA diffusion-based document enhancement method (DocDiff), and the results show the superiority of our method for STISR.

Additionally, as shown in Figure 4, most methods have insufficient capability in recovering textual structures, whereas our method can effectively alleviate this problem. We also give quantitative results for image quality evaluation, as shown in Table 2. Our TextDiff achieves the best MANIQA and LPIPS metrics, while also achieving competitive PSNR and SSIM. Notably, we obtain the LPIPS of 0.0822, a 12% reduction compared to C3-STISR and a 10% reduction compared to TATT.

Finally, we list the failure cases of TextDiff and our future work in the supplementary material.

Ablation Studies

Method	Easy	Medium	Hard	Average
TextDiff w/o $\textrm{B}_{M}$	77.1	60.2	44.3	61.8
TextDiff with NU	79.5	64.0	46.9	64.5
TextDiff with NP	76.6	61.1	43.3	61.4
TextDiff w/o $\mathcal{L}_{joint}$	77.5	61.6	44.2	62.2
TextDiff	80.8	66.5	48.7	66.4

Table 3: Quantitative ablation study results on TextZoom. And the recognition accuracy (%) of different methods are based on ASTER. “w/o” denotes without, “NP” denotes “Noise Prediction”, “NU” denotes an equivalent U-Net is used in place of a diffusion model.

In this section, we perform ablation studies to analyze the contribution of different motivations and model components. All the evaluations are validated on TextZoom. The quantitative and qualitative results of the ablation experiments are provided in Table 3 and in the supplementary material, respectively.

The Mask Branch.

To validate the efficacy of the proposed text mask branch, we ablate it by removing the text mask branch and comparing performance with and without it. Results in Table 3 show that the text mask branch does play a role in improving image quality restoration. The text mask branch enables the network to focus more on the features within the text regions, not only facilitating better extraction of text-level characteristics but also enhancing the connectivity between the two modules. Besides, since the output of the text mask branch does not directly optimize the scene text image, the parameter increase incurred by this branch is not the reason for improved super-resolution performance.

The MRD.

To validate that the performance improvement of our proposed MRD module is not solely due to an increase in parameter size, we replace the MRD module with an identical U-Net structure to form a two-stage regression model. Experimental results show that while simply cascading more encoder-decoder layers can improve recognition accuracy, the perceptual quality is still poor and some fonts are distorted. Cascading the MRD module can improve the mean recognition accuracy by 1.9 $\%$ . This verifies the validity of MRD module.

Noise Prediction and Residual Learning.

We employ the MRD module to predict noise and perform original stochastic sampling. As shown in Table 3, for STISR, predicting residuals from more known conditions achieves better performance than predicting noise. It is shown that prediction residuals are beneficial to text recovery along with deterministic sampling.

Perceptual Loss.

From Table 3, we can see that the perceptual loss can improve accuracy by over 4%. The reason is that the perceptual loss can accurately focus on the difference between text and background, and calculate the loss from shallow textual structure features to deep textual semantic features in the image. In addition, the predicted value input to the loss function is $x_{sr}+x_{0}$ , which can implicitly adjust the fusion effect of pixel-wise addition of TEM and MRD.

Extensions

Method	Easy	Medium	Hard	Average
TSRN	73.4	56.3	40.1	57.7
TSRN + MRD	74.3	59.3	42.4	59.7
TATT	78.5	63.3	45.0	63.3
TATT + MRD	79.0	63.5	46.3	64.0
C3-STISR	79.1	63.3	46.8	64.1
C3-STISR + MRD	79.3	64.2	46.8	64.4
HR	94.2	87.7	76.2	86.6

Table 4: Quantitative results of cascading MRD module with other methods on TextZoom. And the recognition accuracy (%) of different methods are based on ASTER. The bold numbers denote the better score between the baseline and improved method by MRD.

We explore cascading our MRD module with other mainstream approaches to further validate its efficacy and extensibility. Specifically, we achieve joint inference without any joint training, simply by direct inference. The inference results are shown in Table 4 and Figure 1 (see additional cascade inference results in the supplementary material.). The results sufficiently demonstrate that the effectiveness of MRD comes not only from the algorithm itself, but also from its ability to complement other types of methods. For instance, from the cascaded results of MRD with TSRN, it is evident that MRD effectively complements the incomplete font structure obtained from TSRN, leading to results that are more aligned with human perception. As a result, the recognition accuracy of TSRN increases by 2%, without any additional training. Overall, embedding MRD as a post-processing module into existing pipelines can provide further performance boost.

Discussions

We mainly discuss the applicability of PSNR and SSIM metrics for the STISR. As shown in Figure 5, we find that for phrases such as “hat&sturdy”, the PSNR and SSIM of images obtained by using existing methods are relatively high, but the recovered fonts are distorted or blurred. In contrast, the image recovered by TextDiff can perfectly restore the font structure, but the PSNR and SSIM are low. Whether it is from the perspective of human perception or model recognition results, our method is relatively good. This also confirms the quantitative results in Table 2. In summary, as discussed in (Wang et al. 2020) and (Yang et al. 2023), it can be concluded that PSNR and SSIM metrics only partially align with human perception when applied to text images.

Conclusions

In this paper, we propose TextDiff, a novel framework for scene text image super-resolution. The framework comprises two key components: the Text Enhancement Module (TEM) and the Mask-Guided Residual Diffusion Module (MRD). The TEM includes two branches, one branch realizes the preliminary deblurring of images combined with semantic information, and the other branch realizes the learning of text masks. The MRD module learns the residual distribution under the guidance of the text mask, and further restores the text structure and edge contour. Extensive experiments demonstrate that compared to existing methods, our approach achieves state-of-the-art (SOTA) performance on public benchmark datasets. In addition, our MRD module is plug-and-play that effectively sharpens the text edges produced by SOTA methods. We believe our work will provide valuable intuition for further improvement of the STISR task.

References

Chen, Li, and Xue (2021) Chen, J.; Li, B.; and Xue, X. 2021. Scene text telescope: Text-focused scene image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12026–12035.
Chen et al. (2023) Chen, X.; Wang, X.; Zhou, J.; Qiao, Y.; and Dong, C. 2023. Activating more pixels in image super-resolution transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 22367–22377.
Dong et al. (2014) Dong, C.; Loy, C. C.; He, K.; and Tang, X. 2014. Learning a deep convolutional network for image super-resolution. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part IV 13, 184–199. Springer.
Dong et al. (2015) Dong, C.; Loy, C. C.; He, K.; and Tang, X. 2015. Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence, 38(2): 295–307.
Gulrajani et al. (2017) Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; and Courville, A. C. 2017. Improved training of wasserstein gans. Advances in neural information processing systems, 30.
Huang et al. (2023) Huang, C.; Peng, X.; Liu, D.; and Lu, Y. 2023. Text Image Super-Resolution Guided by Text Structure and Embedding Priors. ACM Transactions on Multimedia Computing, Communications and Applications.
Ledig et al. (2017) Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. 2017. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4681–4690.
Li et al. (2022) Li, H.; Yang, Y.; Chang, M.; Chen, S.; Feng, H.; Xu, Z.; Li, Q.; and Chen, Y. 2022. Srdiff: Single image super-resolution with diffusion probabilistic models. Neurocomputing, 479: 47–59.
Liang et al. (2021) Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; and Timofte, R. 2021. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF international conference on computer vision, 1833–1844.
Long, He, and Yao (2021) Long, S.; He, X.; and Yao, C. 2021. Scene text detection and recognition: The deep learning era. International Journal of Computer Vision, 129: 161–184.
Loshchilov and Hutter (2017) Loshchilov, I.; and Hutter, F. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
Luo, Jin, and Sun (2019) Luo, C.; Jin, L.; and Sun, Z. 2019. Moran: A multi-object rectified attention network for scene text recognition. Pattern Recognition, 90: 109–118.
Ma, Gong, and Yu (2022) Ma, H.; Gong, B.; and Yu, Y. 2022. Structure-aware Meta-fusion for Image Super-resolution. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 18(2): 1–25.
Ma, Guo, and Zhang (2023) Ma, J.; Guo, S.; and Zhang, L. 2023. Text prior guided scene text image super-resolution. IEEE Transactions on Image Processing, 32: 1341–1353.
Ma, Liang, and Zhang (2022) Ma, J.; Liang, Z.; and Zhang, L. 2022. A text attention network for spatial deformation robust scene text image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5911–5920.
Ravuri and Vinyals (2019) Ravuri, S.; and Vinyals, O. 2019. Classification accuracy score for conditional generative models. Advances in neural information processing systems, 32.
Rombach et al. (2022) Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; and Ommer, B. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10684–10695.
Saharia et al. (2022) Saharia, C.; Ho, J.; Chan, W.; Salimans, T.; Fleet, D. J.; and Norouzi, M. 2022. Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4): 4713–4726.
Shi, Bai, and Yao (2016) Shi, B.; Bai, X.; and Yao, C. 2016. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence, 39(11): 2298–2304.
Shi et al. (2018) Shi, B.; Yang, M.; Wang, X.; Lyu, P.; Yao, C.; and Bai, X. 2018. Aster: An attentional scene text recognizer with flexible rectification. IEEE transactions on pattern analysis and machine intelligence, 41(9): 2035–2048.
Soh et al. (2019) Soh, J. W.; Park, G. Y.; Jo, J.; and Cho, N. I. 2019. Natural and realistic single image super-resolution with explicit natural manifold discrimination. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 8122–8131.
Sohl-Dickstein et al. (2015) Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; and Ganguli, S. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, 2256–2265. PMLR.
Wang et al. (2019) Wang, W.; Xie, E.; Li, X.; Hou, W.; Lu, T.; Yu, G.; and Shao, S. 2019. Shape robust text detection with progressive scale expansion network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 9336–9345.
Wang et al. (2020) Wang, W.; Xie, E.; Liu, X.; Wang, W.; Liang, D.; Shen, C.; and Bai, X. 2020. Scene text image super-resolution in the wild. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16, 650–666. Springer.
Wang et al. (2021) Wang, X.; Xie, L.; Dong, C.; and Shan, Y. 2021. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In Proceedings of the IEEE/CVF international conference on computer vision, 1905–1914.
Wang et al. (2018) Wang, X.; Yu, K.; Wu, S.; Gu, J.; Liu, Y.; Dong, C.; Qiao, Y.; and Change Loy, C. 2018. Esrgan: Enhanced super-resolution generative adversarial networks. In Proceedings of the European conference on computer vision (ECCV) workshops, 0–0.
Whang et al. (2022) Whang, J.; Delbracio, M.; Talebi, H.; Saharia, C.; Dimakis, A. G.; and Milanfar, P. 2022. Deblurring via stochastic refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16293–16303.
Yang et al. (2022) Yang, S.; Wu, T.; Shi, S.; Lao, S.; Gong, Y.; Cao, M.; Wang, J.; and Yang, Y. 2022. Maniqa: Multi-dimension attention network for no-reference image quality assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1191–1200.
Yang et al. (2023) Yang, Z.; Liu, B.; Xiong, Y.; Yi, L.; Wu, G.; Tang, X.; Liu, Z.; Zhou, J.; and Zhang, X. 2023. DocDiff: Document Enhancement via Residual Diffusion Models. arXiv preprint arXiv:2305.03892.
Yang, Xiong, and Wu (2023) Yang, Z.; Xiong, Y.; and Wu, G. 2023. GDB: Gated convolutions-based Document Binarization. arXiv preprint arXiv:2302.02073.
Zhang et al. (2018) Zhang, R.; Isola, P.; Efros, A. A.; Shechtman, E.; and Wang, O. 2018. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 586–595. IEEE Computer Society.
Zhao et al. (2021) Zhao, C.; Feng, S.; Zhao, B. N.; Ding, Z.; Wu, J.; Shen, F.; and Shen, H. T. 2021. Scene text image super-resolution via parallelly contextual attention network. In Proceedings of the 29th ACM International Conference on Multimedia, 2908–2917.
Zhao et al. (2022) Zhao, M.; Wang, M.; Bai, F.; Li, B.; Wang, J.; and Zhou, S. 2022. C3-stisr: Scene text image super-resolution with triple clues. arXiv preprint arXiv:2204.14044.
Zhu et al. (2023) Zhu, S.; Zhao, Z.; Fang, P.; and Xue, H. 2023. Improving Scene Text Image Super-Resolution via Dual Prior Modulation Network. arXiv preprint arXiv:2302.10414.

Appendix A Model Configuration

We set the initial number of Channels in U-net to 64, with channel multipliers 1, 2, 4, 4. The Residual Down Block and Residual Up Block are structurally consistent. They are composed of convolutional layers, activation layers, dropout layers, self-attention layer, and normalization layers. The only difference is the number of channels. The dropout is set to 0.1.

Appendix B More Details In Ablation Study

In Figure 6, we present the qualitative results of several ablation experiments. We can draw the following analysis conclusions:

•

Adding encoder-decoder layers to optimize pixel loss in a cascade manner may not necessarily improve legibility.
•

Predicting the added noise while sampling too few steps may result in a sampled image with noise and blurred text edges.
•

If the text mask information is not used, the edges of the text in the generated image will be more blurred, which reflects that the text mask has a positive effect on text restoration.
•

Perceptual loss plays an important role in the recovery of text structure.

Appendix C Discussion About Sampling Steps

Method	Easy	Medium	Hard	Average
TextDiff-5	79.8	65.0	47.2	65.0
TextDiff-20	79.9	65.3	47.3	65.2
TextDiff-50	80.0	65.3	47.5	65.3
TextDiff-110	80.2	65.5	47.7	65.5
TextDiff-200	80.8	66.5	48.7	66.4

Table 5: Quantitative results for different sampling steps on TextZoom. TextDiff-n means applying n-step sampling (

T

We set the number of sampling steps to 200 in the training process. In order to further explore the influence of different sampling steps on STISR, we conduct experiments with different sampling steps, and the number of sampling steps is set to 5, 20, 50, 110 and 200 respectively. The quantitative results of different sampling steps are shown in Table 5. As expected, as we increase the sampling step, the recognition accuracy is gradually increasing. We emphasize that TextDiff is able to produce high-quality images within a few steps, thanks to its training strategy of predicting original data and its deterministic sampling strategy.

Appendix D More Details In Cascading Inference

The specific implementation of the cascade operation is to first input the LR image into the existing method for processing, and then input the obtained image into the Mask-Guided Residual Diffusion Module (MRD) we proposed. In particular, since joint training is not required, in the experiments, for existing methods, we directly use models from the official implementation or perform the identical hyper-parameters as reported in the official implementations to train the baseline models. Furthermore, we present some additional qualitative results of cascade inferences in Figure 7. From these figures, we can intuitively feel the strong recovery ability and robustness of our proposed MRD module.

Appendix E Failure Case and Limitation

Figure 8 shows some failure cases. For some LR images, due to the similarity between characters, there is a problem that some letters are restored to other characters. This problem appears to be less likely to occur compared to existing methods. However, although the corresponding strategy (e.g., residual learning) is used in our proposed TextDiff to solve this problem, it is not enough. Therefore, we leave it as future work.

Appendix F Future Work

In our work, there are still problems to be solved. As in the experiment, it is found that character substitution still exists in scene text recovery, which is also an inherent problem in existing methods. In TextDiff, the potential solutions are to enhance the text positioning ability, optimize the conditional input method of the diffusion model, increase the diversity of samples, etc.