TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision

Yukun Zhai¹, Xiaoqiang Zhang², Xiameng Qin², Sanyuan Zhao¹, Xingping Dong¹, Jianbing Shen¹
¹School of Computer Science, Beijing Institute of Technology
²Department of Computer Vision Technology, Baidu Inc.
{zhaiyukun, zhaosanyuan, shenjianbing}@bit.edu.cn
{xingping.dong}@gmail.com, {zhangxiaoqiang01, qinxiameng}@baidu.com

Abstract

End-to-end text spotting is a vital computer vision task that aims to integrate scene text detection and recognition into a unified framework. Typical methods heavily rely on Region-of-Interest (RoI) operations to extract local features and complex post-processing steps to produce final predictions. To address these limitations, we propose TextFormer, a query-based end-to-end text spotter with Transformer architecture. Specifically, using query embedding per text instance, TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling. It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing without sacrificing flexibility or simplicity. Additionally, we design an Adaptive Global aGgregation (AGG) module to transfer global features into sequential features for reading arbitrarily-shaped texts, which overcomes the sub-optimization problem of RoI operations. Furthermore, potential corpus information is utilized from weak annotations to full labels through mixed supervision, further improving text detection and end-to-end text spotting results. Extensive experiments on various bilingual (i.e., English and Chinese) benchmarks demonstrate the superiority of our method. Especially on TDA-ReCTS dataset, TextFormer surpasses the state-of-the-art method in terms of 1-NED by 13.2%.

1 Introduction

Scene text spotting, which aims to locate and read text from natural images, has received widespread academic and industrial attention due to its numerous real-world application values, including image retrieval [10], fake news detection [41], augmented reality [56], blind navigation [44], visual question answering [1], document understanding [22, 58, 59], etc. Traditionally, a scene text spotter consists of a text detector and a recognizer. The detector identifies the position of text, represented as polygon coordinates or a shape mask, and the recognizer transcribes the corresponding text. Unlike isolation-based training for text detection and recognition tasks [17, 13, 35], end-to-end methods [31, 27, 55, 39, 37, 38, 51, 57, 52] that train two tasks within a unified framework have received increasing interest due to their benefits of a unified framework and global optimization. However, there is still room for improvement due to the inherent challenges of end-to-end text spotting, such as arbitrarily-shaped texts with large variations in font, style, and size and ambiguous texts with juxtaposed text lines [53].

Refer to caption — Figure 1: Comparison of previous end-to-end text spotting methods with our query-based pipeline. (a) shows the traditional top-down (two-stage) framework, which uses RoI operation to extract features and integrate the detection and recognition branches. (b) depicts the bottom-up framework (one-stage), which predicts all possible characters and requires intricate post-processing step to match all characters into the correct words. (c) is our query-based framework that directly outputs both segmentation and recognition results without RoI operation and post-processing step.

Early end-to-end methods [27, 55, 39, 38, 51, 26] are designed using a top-down approach, following a two-stage paradigm that first detects text areas and then recognizes corresponding text using Region-of-Interest (RoI) operation, as depicted in Fig. 1 (a). Despite their promising performance, these methods only share shallow backbone features and adopt local RoI features for different tasks, which can lead to two significant problems: (1) The recognition performance is heavily reliant on the accuracy of detection results; (2) The scope of recognition learning is limited to enhancing detection features. As illustrated in Fig. 1 (b), recent bottom-up methods [57, 37, 52] strive to break free from the top-down paradigm by involving character detection and classification, thus reducing the strong dependence between text detection and recognition. Yet these methods are plagued by inaccurate character detection results and require time-costly character-level annotation.

People naturally focus on the general area while reading the text and then adjust their focus to find the correct content [43]. Recognizing this pattern and the limitations of the above frameworks, we propose there is no strong coupling between text detection and recognition tasks, and they can be mutually promoted by sharing deeper features through a multi-task model design. And inspired by recent query-based methods [5, 62, 7] in object detection and instance segmentation, we introduce TextFormer, a query-based end-to-end text spotter based on Transformer architecture. It comprises an image encoder, a text decoder, and a multi-task module containing classification, segmentation, and recognition branches. For each text query, TextFormer predicts its category, segmentation mask, and corresponding text transcription in parallel. The multi-task module can be collaboratively trained and optimized to share complementary information by learning shared semantic features from the text decoder. In addition, an Adaptive Global feature aGgregation (AGG) module is employed to extract features of different orientations for text recognition, enabling our network to read arbitrarily-shaped text.

However, achieving adequate joint training is not straightforward: (1) Full annotations (Fig. 2 (a)) to support end-to-end text spotting are overly expensive; (2) The recognition branch requires far more training data than the detection branch, especially for reading Chinese text in the wild. This is because the number of Chinese characters is much larger than that of Latin-based languages. Previous methods solve this problem by replacing time-costly bounding box or polygon annotations with single-point annotations [36] or even retaining only text annotations [47] (Fig. 2 (b)), while end-to-end text spotting performance is largely behind compared to the same text spotter trained with full annotations.

To address this issue, we present mixed supervision for training an end-to-end text spotter. It utilizes a mixture of weak annotations (Fig. 2 (c)) and full annotations, which can meet the needs of data quantity for text recognition and further boost the performance of end-to-end text spotting. And in order to allow this training setting, we simply modify the original composition of Hungarian matching loss [5] without changing the network structure. With the generic design of our framework and mixed-supervision training strategy, our TextFormer achieves the state-of-the-art performance on common benchmarks.

In summary, our main contributions can be summarized as follows:

•

We design a novel, fully end-to-end text spotter with a multi-task model design that uses text queries to bridge the classification, segmentation, and recognition branches together. It also allows a global feature extractor named AGG to extract features from different orientations for reading arbitrarily-shaped texts.
•

We train our network with mixed supervision to improve the effect of co-optimization for text detection and recognition, which utilizes a mixture of weak annotation and full labels. To our knowledge, we are the first to adopt mixed supervision to tackle the end-to-end text spotting task.
•

Extensive experiments on public benchmarks demonstrate the state-of-the-art performance of our method on both text detection and end-to-end text spotting tasks. In particular, we evaluated TextFormer on an ambiguous text spotting dataset named TDA-ReCTS, where it outperforms its counterparts by a large margin of 13.2% in terms of 1-NED.

2 Related Work

2.1 End-to-end Text Spotting

We compare two paradigms for typical end-to-end text spotting frameworks, top-down and bottom-up methods. The top-down approach employs RoI operations to link the text detector and recognizer. In contrast, the bottom-up approach utilizes character as the basic unit to perform text detection and recognition, eliminating the need for the RoI step.

2.1.1 Top-down Methods

The initial approaches aim to read regular texts via a top-down framework. The first end-to-end trainable text spotter [21] was inspired by Faster R-CNN [42] and used RoI Pooling to unify the detection and recognition parts into a unified network. An advanced version [4] was designed to handle multi-oriented texts. To further integrate the detection and recognition stages, FOTS [26] introduced RoI-Rotate to extract the text proposal features corresponding to the detection results. Meanwhile, He et al. [16] developed a similar framework that employed an attention-based sequence decoder as its recognition head. Despite the excellent performance of these methods on regular texts, these methods were ineffective at processing arbitrarily-shaped texts.

Based on Mask R-CNN [15], Mask TextSpotter [31] added a character-level segmentation branch to read the text of arbitrary shapes. Qin et al. [39] directly adopted Mask R-CNN and introduced RoI Masking to crop features with segmentation masks. And PAN++ [55] proposed an optimized feature extractor named Masked RoI to better remove noise from the background. TextDragon [12] described the shape of text with a series of quadrangles and used RoISlide to sample features. Boundary [51] and Text Perceptron [38] regraded the text boundaries as key points and applied the thin-plate-spline transformation [2] to rectify irregular text instances. ABCNet [27] and ABCNet v2 [28] fitted arbitrarily-shaped texts by parameterized Bezier curves and proposed BezierAlign as the feature extractor.

Most of the methods mentioned above use RoI operations to connect the text detector and the recognizer in a linear process of detection followed by recognition. As a result, the recognition performance is depended on the accuracy of detection results and RoI operations, especially for ambiguous texts [53].

2.1.2 Bottom-up Methods

Recently, some works have attempted to overcome the limitations of RoI operations for top-down methods. CharNet [57] directly segmented single character instance, which requires character-level annotations for training. And the final text instance can be generated by post-processing steps to group characters into words. MANGO [37] exploited the position-ware mask attention module to predict characters in each text instance, which utilized the sequence context information among individual characters. To avoid using character-level annotations, PGNet [52] learned the pixel-level character classification map by point-gathering operation, while the text recognition module was still dependent on the detection part. The above approaches alleviate the reliance between detection and recognition breaches. However, character-level labels or complex post-processing steps are still required for training and inference.

2.1.3 Query-based Methods

DETR [5] has introduced a new paradigm in object detection and image segmentation, eliminating the need for RoI operators and complex post-processing procedures. Instead, object queries are used to aggregate features and then generate the detection and segmentation results for each object instance using the Transformer encoder-decoder architecture. Researchers have attempted to lead this paradigm to the scene text detection task with tailored designs. Raisi et al. [40] was the first to introduce the Transformer framework for scene text detection, treating text instances as object queries to handle multi-oriented text. Jingqun et al. [48] then proposed a novel encoder-decoder architecture that performed text detection based on a few sampled point features represented by feature queries.

Our TextFormer is a scene text spotter based on the query design. Specifically, it leverages text queries to represent all possible text in the scene image and character queries to represent the characters within each text instance.

2.2 Mixed Supervision

Previous works have focused on using varying levels of annotated data in computer vision tasks, such as object detection [3], segmentation [33], and scene text detection [49], etc. SPTS [36] used single-point labels to indicate text locations, while TOSS [47] utilized a fully transcription-based supervised approach that trained the text spotter with text annotations. Despite using different kinds of annotated data, the model performance is largely behind the common text spotter. Instead, we use a mix of weak annotations with full annotations to leverage the potential of the query-based text spotter and further enhances the performance of our model, which is formulated as mixed supervision. To the best of our knowledge, we are the first to adopt mixed supervision to tackle the end-to-end text spotting task.

3 Methodology

Our TextFormer consists of four main components: (1) an image encoder-decoder to extract semantic features and share them with the multi-task branches, (2) a classification branch to classify text queries, (3) a segmentation branch to segment text areas, and (4) a recognition branch to read text. In this section, we first present the framework of TextFormer and then introduce our mixed supervision tailored for end-to-end text spotting.

3.1 Overall Architecture

The overall architecture of TextFormer is illustrated in Fig. 2. The encoder blocks perform self-attention across multi-scale feature maps. Then, text queries are fed through the decoder blocks to generate semantic features for the parallel heads, which predict masks and corresponding characters to each text instance. We then specify the implementations of each module in the sub-sections below.

3.2 Encoder-Decoder Module

Given a natural scene image $X\in\mathbb{R}^{H\times W\times 3}$ , the backbone with Feature Pyramid Network (FPN) [24] extracts feature maps $P_{5}$ , $P_{4}$ , $P_{3}$ , $P_{2}$ . The resolution of multi-scale feature maps correspond to $\frac{1}{32},\frac{1}{16},\frac{1}{8},\frac{1}{4}$ comparing to original image size with 256 channels respectively. We then flatten and concatenate the first three features into feature sequence with size of $(L_{5}+L_{4}+L_{3})\times 256$ , where $L_{i}$ denotes the flattened length of $P_{i}$ , which is $\frac{H}{2^{i}}\times\frac{W}{2^{i}}$ . Added with 2D Positional Embedding [5], the feature sequence is fed into Transformer encoder blocks to generate refined features. We equip each encoder layer with deformable attention [62] that constraints cross-attention at a fixed number of reference points, which can largely reduce computational overheads and better capture small text instances.

We take $N$ randomly initialized embeddings as text queries, standing for each possible text instance in the scene image. Next, using text queries and refined features as inputs, the Transformer decoder blocks output $N$ text embeddings $\varepsilon_{text}\in\mathbb{R}^{N\times 256}$ . To speed up the convergence of the training process, we replace the standard Transformer decoder with masked-attention decoder [7], which limits cross attention to within the predicted region for each text query.

After getting text embeddings $\varepsilon_{text}$ from decoder blocks, we compute the shared semantic features $S$ for multi-task heads. Concretely, we first upsample the $P_{2}$ feature to generate pixel embedding $\varepsilon_{pixel}\in\mathbb{R}^{\frac{H}{4}\times\frac{W}{4}\times 256}$ . Then we obtain each semantic feature $S_{i}\in\mathbb{R}^{\frac{H}{4}\times\frac{W}{4}\times 256}$ via a dot product between text embedding $\varepsilon_{text}$ and pixel embedding $\epsilon_{pixel}$ . The shared semantic features contain both global and local information, which are devoted to mutually optimizing the network.

3.3 Recognition Branch

The detailed structure of our recognition branch is shown in Fig. 4. It is composed of a global feature extractor and an attention-based recognizer.

3.3.1 AGG Module

The AGG module serves as a global feature extractor, which is used to obtain sequential features from the shared semantic features. Our approach is inspired by the observed linearity of the line-by-line reading order, and as such, we aggregate the global feature map in both horizontal and vertical orientations. These aggregated features are then concatenated and used for scene text recognition.

Firstly, we employ a $1\times 1$ convolution layer with a sigmoid function to compute the attention mask $M_{i}\in\mathbb{R}^{\frac{H}{4}\times\frac{W}{4}\times 256}$ of each text query from the shared semantic feature $S_{i}$ , which is described as:

\displaystyle M_{i}=sigmoid(Conv(S_{i}))

(1)

Secondly, as shown in Fig. 4, we multiply the shared semantic feature $S_{i}$ by the attention mask $M_{i}$ to pay more attention to the text instance and decrease the effect of background noise. Then, we adaptive aggregating the obtained global feature map under the guidance of the attention mask $M_{i}$ in the corresponding column to obtain the horizontal feature $F_{h}$ . The vertical feature $F_{v}$ can be extracted in the same way. Thus, adaptive global feature aggregation could be formulated in two directions as:

	$\displaystyle F_{h}=AGG_{h}(S_{i}\odot W_{i})$		(2)
	$\displaystyle F_{v}=AGG_{v}(S_{i}\odot W_{i})$		(2)

where $\odot$ denotes element-wise multiplication. $AGG_{(\cdot)}(\cdot)$ represents an adaptive summation operation in the corresponding direction, as implemented in our model. Specifically, we compute the sum of the feature map along a certain direction and divide it by the sum of the mask attention map in the same direction. As such, these two directional vectors contain information pertaining to the character sequence in the different orientations.

Finally, we obtain the sequence vector fed into the recognition head as:

\displaystyle x_{seq}=concat\left[F_{h}\oplus e_{h},F_{v}^{T}\oplus e_{v}\right]\oplus e_{d}

(3)

In addition to the 1D sine positional embedding added to $F_{h}$ and $F_{v}$ , we add a direction embedding, denoted as $e_{d}$ , to the concatenated feature sequence, which allows the following recognition head to identify the direction in each feature sequence lies. The direction embedding is randomly initialized and can be jointly trained in our network.

The proposed AGG module is completely differentiable, enabling the recognition loss to be propagated to the segmentation branch, thereby allowing mutual training and optimization of both branches in an end-to-end manner. Moreover, the AGG module extracts features directly from the global feature maps without the need for any spatial rectification operation.

3.3.2 Recognition Head

Since the input sequence vector $x_{seq}$ is already rich in information after being encoded by the AGG module, we simply use the standard Transformer-based decoder as our recognition head. The initial character queries are randomly initialized, similar to the above-mentioned text queries. With the output sequence of the Transformer decoder, we can transform the recognition task into a parallel character classification problem. Concretely, we employ a fully-connected (FC) layer to get the final predicted character sequence.

\displaystyle x_{rec}=FC(Decoder(x_{seq}))

(4)

where $x_{rec}\in\mathbb{R}^{K\times C}$ . $K$ is the number of character queries and $C$ is the number of character classes. During training, the ground truth text is padded with $\left[\text{PAD}\right]$ symbols until reaches the pre-defined length $K$ .

3.4 Classification Branch

The classification branch aims to classify the text queries from semantic features. We use global average pooling and three consecutive FC layers with a softmax activation function to predict class probabilities which include text, background, and no-text classes.

3.5 Segmentation Branch

The segmentation head of our model is designed using a stack of $3\times 3$ convolution layers, followed by a $1\times 1$ convolution layer that incorporates a sigmoid function to obtain the final text instance mask. To avoid overlap, each pixel is assigned to the instance with the highest mask probability by merging all of these segmentation masks. Ultimately, the polygons of each instance can be generated by applying connected component analysis to the corresponding text mask.

3.6 Mixed Supervision

Our TextFormer benefits from the parallel design, as illustrated in Fig. 3, which allows for simultaneous multi-task learning based on the shared semantic features. However, while detecting text position is relatively straightforward, training text recognition requires a significantly larger amount of data and training rounds [39]. To improve the co-optimization between text detection and recognition, it is important to make the best use of the different annotated labels. To address this problem, we propose mixed supervision to reduce the inconsistency between text detection and recognition. During this training paradigm, hybrid data with various annotations are fed into our network, as shown in Fig. 2. Full annotations involve both text instance masks and corresponding transcriptions. Text annotations provide instance-level transcriptions, while weak annotations provide a single transcription of the text of interest [45].

3.6.1 Matching Cost

Assuming that the ground-truth of text instance is represented as a triplet set $(C_{q},M_{q},T_{q})$ which indicates query class, query mask, and text transcription. For convenience, we describe the text transcription as $T_{q}=\left\{t^{k}_{q}\right\}^{K}_{k=1}$ , where $K$ is the length of character sequence. The predicted result $\hat{y}$ and the label $y$ are formulated as:

		$\displaystyle\hat{y}=\left\{\hat{y}_{j}=(\hat{C}_{j},\hat{M}_{j},\hat{T}_{j})\right\}_{j=1}^{N}$		(5)
		$\displaystyle y=\left\{\left\{y_{q}=(C_{q},M_{q},T_{q})\right\}_{q=1}^{N_{gt}},\left\{\phi,\cdots,\phi\right\}^{N-N_{gt}}\right\}$		(5)

where $N$ is a pre-defined number of text predictions in our method, which equals to the number of text queries. Moreover, we append several $\phi$ tags (meaning “no text”) to pad the the number of text instance labels in ground truth to $N$ .

In order to train our network with different supervisions, we employ a bipartite matching strategy [5] to compute the one-to-one matching cost. The matching cost contains three components: classification cost, mask cost, and recognition cost. Suppose that $\{y_{q},\hat{y}_{j}\}$ is the candidate of matched pair, the matching cost can be described as:

	$\displaystyle\mathcal{L}_{match}(y_{q},\hat{y}_{j})=\{\mathcal{C}_{mask}(M_{q},\hat{M}_{j})-\mathcal{C}_{rec}(T_{q},\hat{T}_{j})$		(6)
	$\displaystyle-\mathcal{C}_{cls}(C_{q})\}_{\{C_{q}\neq\phi\}}$		(6)

where $\mathcal{C}_{cls}$ is the classification cost that equals the predicted probability of query class. $\mathcal{C}_{mask}$ indicates the mask cost which is the similarity of the predicted and ground truth masks. The recognition cost $\mathcal{C}_{rec}$ calculates the average probability of predicted text sequence with ground truth, which is represented as:

\displaystyle\mathcal{C}_{rec}(T_{q},\hat{T}_{j})=\sum_{k=1}^{K}\hat{t}_{j}^{k}(t_{q}^{k})

(7)

At last, we obtain the best one-to-one matching pairs by minimizing the total matching cost $\mathcal{L}_{match}$ , which is expressed as:

\displaystyle\hat{\sigma}=\underset{\sigma\in\mathfrak{S}_{N}}{argmin}\sum_{q}^{N}\mathcal{L}_{match}(y_{q},\hat{y}_{j})

(8)

where $\mathfrak{S}_{N}$ are the permutations sets for matching.

Table 1: End-to-end text spotting results on ICDAR 2015. “S”, “W”, “G” mean text recognition with Strong (S), Weak (W), and Generic (G) lexicon, respectively. The prefix “S-” and “L-” indicate rescaling the input image by the shorter or longer side. MANGO* [37] is evaluated with IOU 0.1.

Method	Input Size	Detection			End-to-end
Method	Input Size	Precision	Recall	F-measure	S	W	G
FOTS [26]	L-2240	91.0	85.2	88.0	81.1	75.9	60.8
TextNet [46]	-	89.4	85.4	87.4	78.7	74.9	60.5
He et al. [16]	-	87.0	86.0	87.0	82.0	77.0	63.0
Mask TextSpotter [31]	S-1600	91.6	81.0	86.0	79.3	73.0	62.4
TextDragon [12]	-	92.5	83.8	87.9	82.5	78.3	65.1
Unconstrained [39]	S-900	89.4	85.8	87.5	83.4	79.9	68.0
Text Perceptron [38]	L-2000	92.3	82.5	87.1	80.5	76.6	65.1
Mask TextSpotter v3 [23]	S-1440	-	-	-	83.3	78.1	74.2
PAN++ [55]	S-896	91.4	83.9	87.5	82.7	78.2	69.2
ABCNet v2 [28]	S-1000	90.4	86.0	88.1	82.7	78.5	73.0
Boundary TextSpotter [30]	S-1080	88.7	84.6	86.6	82.5	77.4	71.7
ABINet++ [11]	S-1000	-	-	-	84.1	80.4	75.4
CharNet R-50 [57]	-	91.2	88.3	89.7	80.1	74.5	62.2
PGNet [52]	L-1536	91.8	84.8	88.2	83.3	78.3	63.5
MANGO* [37]	L-1800	-	-	-	81.8	78.9	67.3
TextFormer	S-1000	94.3	89.2	91.7	84.5	80.9	76.0

3.6.2 Loss Function

For the segmentation task, we use a combination of a dice loss [32] and a focal loss [25]. For the classification and recognition tasks, we simply use the cross-entropy loss for each predicted text instance. Then, the overall training loss is as follows,

\mathcal{L}=\mathcal{L}_{cls}+\lambda_{mask}(\mathcal{L}_{focal}+\mathcal{L}_{dice})+\lambda_{rec}\mathcal{L}_{rec}

(9)

where $\lambda_{mask}$ and $\lambda_{rec}$ are the weights of segmentation and transcription losses.

Training our model with text annotations or weak labels, the matching cost is based solely on the classification and recognition cost. And the loss terms are composed of the classification loss and the recognition loss. There is a slight difference when training with weak labels that we only consider the classification loss between the matched query and the ground truth (1 to 1). We do this to avoid text queries from predicting only single text in a text image.

Table 2: Text spotting results on Total-Text. “None” represents lexicon-free and “Full” means lexicons of all images. MANGO* [37] is evaluated with IOU 0.1.

Method	Input Size	Detection			End-to-end
Method	Input Size	Precision	Recall	F-measure	None	Full
Mask TextSpotter [31]	S-1000	69.0	55.0	61.3	52.9	71.8
TextNet [46]	-	68.2	59.5	63.5	54.0	-
TextDragon [12]	-	85.6	75.5	80.3	48.8	74.8
ABCNet [27]	-	-	-	-	64.2	75.7
ABCNet-MS [27]	-	-	-	-	69.5	78.4
Unconstrained [39]	S-600	83.3	83.4	83.3	67.8	-
Text Perceptron [38]	L-2000	87.5	81.9	84.6	57.0	76.6
Mask TextSpotter v3 [23]	-	-	-	-	71.2	78.4
PAN++ [55]	S-736	-	-	-	68.6	78.6
ABCNet V2 [28]	S-1000	90.2	84.1	87.0	70.4	78.1
Boundary TextSpotter [30]	S-800	89.6	81.2	85.2	66.2	78.4
ABINet++ [11]	S-1000	-	-	-	77.6	84.5
CharNet [57]	-	89.9	81.7	85.6	66.0	-
PGNet [52]	L-640	86.8	85.5	86.1	63.1	-
MAGNO* [37]	L-1280	-	-	-	71.7	82.6
TextFormer	S-1000	89.3	85.0	87.1	77.9	84.9

4 Experiments

To evaluate the effectiveness and robustness of our proposed TextFormer, we conduct experiments on several challenging scene text benchmarks, including multi-oriented text dataset ICDAR 2015 [18], arbitrarily-shaped text dataset Total-Text [8], two bilingual text benchmarks ReCTS [60] and LSVT [45], and ambiguous text dataset TDA-ReCTS [53]. Ablation studies are preformed on ICDAR2015 and TDA-ReCTS to verify the legitimacy of our recognition branch and mixed supervision.

Table 3: Text spotting results on TDA-ReCTS and ReCTS benchmarks. “P”, “R”, “F”, and “1-NED” indicate precision, recall, F-measure, and normalized edit distance, respectively. Method marked with * means the model is trained with an extra weakly annotated dataset LSVT (only a single text-of-interest transcription is provided per image without any location annotation).

Method	TDA-ReCTS				ReCTS
Method	Precision	Recall	F-measure	1-NED	Precision	Recall	F-measure	1-NED
EAST [61]	70.6	61.6	65.8	-	74.3	73.7	74.0	-
PSENet [54]	72.1	65.5	68.7	-	87.3	83.9	85.6	-
FOTS [26]	68.2	70.0	69.1	34.9	-	-	-	50.8
Mask TextSpotter [31]	80.1	74.9	77.4	46.7	89.3	88.8	89.0	67.8
ABCNet v2 [28]	-	-	-	-	93.6	87.5	90.4	62.7
AE TextSpotter [53]	84.8	78.3	81.4	51.3	92.3	91.5	91.9	73.1
TextFormer	84.6	82.0	83.3	63.1	94.2	89.8	91.9	77.7
TextFormer*	84.6	82.7	83.6	64.5	94.4	90.0	92.2	78.8

4.1 Datasets

SynthText 150k [27] is a synthesized dataset for arbitrarily-shaped scene text. It contains 150k synthetic text images with polygon annotations, which are composed of one-third of curved text, and the rest are multi-oriented text.

SynChinese 130k [28] is a synthesized dataset for bilingual scene text (English and Chinese) with a multi-oriented arrangement. It includes 130k synthetic images with quadrilateral annotations.

ICDAR 2015 [18] consists of 1,000 training images and 500 testing images that are incidentally captured in the scene street. It contains primarily multi-oriented and small text, which are hard to detect and recognize.

Total-Text [8] is an arbitrary-shaped dataset with horizontal, multi-oriented, and curved text. It has 1,255 images for training and 300 images for testing, annotated with the polygonal bounding box at the word level. Each image has at least one curved text.

ReCTS [60] is a Chinese multi-orientation natural scene text dataset. It contains 25k signboard images divided into 20k images for training and 5k images for testing. Compared to the English dataset, Chinese characters have a large number of categories, and their layout and arrangement are more complex and varied.

TDA-ReCTS [53] is a multi-language validation benchmark for text detection ambiguity, which includes 1k images selected from ReCTS in the case of large character spacing or juxtaposed text lines. And the rested ReCTS images are deemed as the training set.

LSVT [45] provides a large variety of texts from street view in which 50k images are fully annotated, and 400k images are partially annotated (only provides one transcription annotation per image).

4.2 Implementation Details

4.2.1 Network

The backbone network is ResNet-50 pre-trained on ImageNet [20], and the text query number is set to 50. The recognition branch is composed of two layers of Transformer decoder with 32 character queries for the English datasets and 50 character queries for the Chinese datasets. All models are trained with a batch size of 2 on 2 A100 GPUs, using AdamW [29] optimizer with poly learning rate strategy [6]. The initial learning rate is $10^{-4}$ with the weight decay of $0.05$ .

4.2.2 Training

We divide the training process into two steps: pre-training and fine-tuning. For Chinese datasets like TDA-ReCTS and ReCTS, we pre-train our model on SynChinese 130k and fine-tune it on the target datasets. And we further train our model with mixed supervision, using the mixture of the weakly-annotated dataset LSVT and the target dataset. For English datasets like ICDAR 2015 and Total-Text, the pre-trained data is collected from public datasets, including SynthText 150k, ICDAR 2013 [19], ICDAR-MLT [34], COCO-Text [50] and ArT [9]. And the pre-trained model is then fine-tuned on the target datasets. The following data augmentations are used in our training, which include: (1) randomly resizing the shorter side of the image of [512, 640, 800, 1024], and (2) randomly rotating the image with an angle of $[0^{\circ},90^{\circ},180^{\circ},270^{\circ}]$ . And the size of the character dictionary is set to be 36 (including 26 letters, 10 digits) for English datasets and 5462 for Chinese datasets.

4.2.3 Inference

During inference stage, the shorter side of the image is resized to 1000 while keeping the aspect ratio.

4.3 Evaluation of English Text Spotting Benchmarks

The evaluation metric follows the standard polygon evaluation principle with a 0.5 IoU threshold for a fair comparison. And the data in the table are presented in percentages.

We first test our method on ICDAR 2015 to show the effectiveness on multi-oriented scene text dataset. Table 1 shows that TextFormer achieves the best performance on detection and end-to-end (E2E) items under “Strong”, “Weak”, and “General” lexicons comparing to previous methods. The best recall result from Table 1 reveals the powerful ability of our model to capture the small text instance in the scene image. And our method outperforms ABINet++ in terms of the “General” metric without using an extra language model. Some example results for ICDAR 2015 are shown in Fig. 5. Whether it is the horizontal, vertical, dense, blurred, or obscured layout, our method can successfully detect and recognize these texts successfully.

For the arbitrarily-shaped scene text, we conducted an experiment on Total-Text to test the performance of our model. The results are presented in Table 2. In the text detection task, our model can achieve a comparable result comparing to ABCNet v2. As for the end-to-end text spotting task, TextFormer surpasses the state-of-the-art method both in “None” and “Full” metrics. Some qualitative results are shown in Fig. 6. In summary, the results on ICDAR 2015 and Total-Text demonstrate the superiority of our model for multi-oriented and arbitrarily-shaped text spotting.

4.4 Evaluation of Chinese Text Spotting Benchmarks

In addition to the English text spotting dataset, we also evaluate our method on Chinese text spotting benchmarks to further verify the generalization ability of TextFormer. Compared to reading Latin-based text, Chinese text spotting remains challenging due to its large vocabulary of Chinese characters and detection ambiguity, which often occurs when multiple horizontal and vertical text lines are interspersed. Therefore, we evaluate our method on TDA-ReCTS, ReCTS, and LSVT following the standard evaluation protocol metrics [53].

As shown in Table 3, our method achieves the best performance on TDA-ReCTS and ReCTS datasets compared to previous end-to-end methods. Especially on the TDA-ReCTS dataset, which is proposed for ambiguous text spotting, TextFormer outperforms AE TextSpotter by 11.8% (63.1% vs. 51.3%) in terms of 1-NED without using an additional language model. We attribute this to the query-based multi-task modeling that makes our model focus on semantic text area rather than meaningless detection region. Furthermore, by training our model with weak annotation that contains a single text transcription per image, TextFormer* can further improve the detection and end-to-end performance which achieves an E2E 1-NED of 64.5% on TDA-ReCTS and 78.8% on ReCTS. This greatly validates the effectiveness of our proposed mixed-supervision learning scheme. Some examples for TDA-ReCTS are shown in Fig. 7. Our model can handle ambiguous texts well and learn the reading order correctly.

And we also conduct an experiment on the ICDAR 2019 LSVT competition. Our TextFormer achieves the F-measure of 54.2% and the 1-NED of 60.7%. And with the use of weak annotations provided in LSVT dataset, it can further improve end-to-end performance and achieve an E2E 1-NED of 62.2%. Note that without using extra datasets, our single-scale and single-model result is comparable to the Top3 methods on the leader board, which using extra datasets, multi-model ensemble, and multi-scale testing strategy.

Table 4: Ablation study of different tasks and different supervisions on TDA-ReCTS. The columns “Masks” and “Texts” mean training the model with the annotations of instance masks and texts, respectively. “Weak Texts” denotes only one text transcription per image is used to train the model.

Row	Tasks		Annotations			Detection			End-to-end
Row	Detection	Recognition	Masks	Texts	Weak Texts	Precision	Recall	F-measure	1-NED
1	✓	$\times$	✓	$\times$	$\times$	83.3	78.2	80.7	-
2	$\times$	✓	$\times$	✓	$\times$	78.9	61.8	69.3	44.3
3	$\times$	✓	$\times$	✓	✓	73.4	66.6	69.8	47.0
4	✓	✓	✓	✓	$\times$	84.6	82.0	83.3	63.1
5	✓	✓	✓	✓	✓	84.6	82.7	83.6	64.5

Table 5: Comparison of different feature extractors on ICDAR 2015. “S”, “W”, “G” mean text recognition with Strong (S), Weak (W), and Generic (G) lexicon, respectively.

Row	Feature Extractor	Detection			End-to-end
Row	Feature Extractor	Precision	Recall	F-measure	S	W	G
1	Masked RoI	94.3	89.0	91.6	77.1	77.0	70.0
2	$AGG_{v}$	93.9	88.4	91.0	78.5	78.9	73.8
3	$AGG_{h}$	95.4	85.2	90.0	83.0	79.5	75.7
4	$AGG$	94.3	89.2	91.7	84.5	80.9	76.0

4.5 Ablation Studies

In this section, we first conduct an ablation study on TDA-ReCTS to verify the effectiveness of multi-task modeling and our proposed mixed supervision. And then, we do ablation studies on ICDAR 2015 to evaluate the effectiveness of our feature extractor and the design of the text recognizer.

4.5.1 Effectiveness of Multi-task Modeling

To prove the superiority of multi-task modeling, we compare the performance of a single detector (row #1) with our end-to-end text spotter (row #4) on TDA-ReCTS. In the spirit of equality, the detector is separated from our end-to-end text spotter. The detection results are shown in Table 4. The end-to-end model achieves 83.3% F-measure on the text detection task, which surpasses the single detector in 2.6% (83.3% vs. 80.7%). It can be seen that the multi-task modeling improves the detection result greatly. The potential reason is that in the end-to-end framework, the detection branch and recognition branch can be mutually trained. They share the same backbone and semantic features, so the recognition loss can propagate to the detection branch, which improves the detection result.

Table 6: Effectiveness of recognition head design under different numbers of decoder layers and different recognition loss on ICDAR 2015. “S”, “W”, “G” mean text recognition with Strong (S), Weak (W), and Generic (G) lexicon, respectively. “P”, “R”, “F” indicate precision, recall and f-measure. “CE” represents cross-entropy loss.

Layer	Loss	Detection			End-to-end
Layer	Loss	P	R	F	S	W	G
1L	CE	95.3	86.3	90.6	82.2	78.6	72.3
2L	CE	94.3	89.2	91.7	84.5	80.9	76.0
2L	CTC	95.1	85.8	90.0	82.8	79.4	74.2
3L	CE	94.6	86.0	90.1	82.1	78.8	74.3

4.5.2 Effectiveness of Mixed Supervision

As discussed in Sec 3.6, our model can be trained in a mixed-supervision learning manner. In Table 4, we show the detection and end-to-end results on TDA-ReCTS. Using real data with only text annotations and synthetic data with full annotations (row #2), our model can achieve a competitive result compared to previous fully supervised methods [26, 31]. And we further train our model (row #3) using weak texts in the LSVT dataset, which only contains single transcription per image. We see that it obtains better detection and end-to-end results. The same improvement is shown in (row #5). Compared with the standard text spotter that uses full annotations to train the model (row #4), we add additional data with weak text annotations (row #5) which achieves a 1.4% boost on the E2E item. With this learning way, we can significantly reduce the labeling cost while improving the results of the model.

4.5.3 Effectiveness of AGG Module

As presented in Table 6, we show the effectiveness of our proposed AGG (row #1 vs. row #4). Higher detection and end-to-end results are obtained in the presence of AGG module. Compared with Masked RoI feature extractor [55], which extracts the feature locally, the model with AGG module outperforms it by 6% in “General” metric. This demonstrates that our recognition head benefits a lot from the proposed global feature extractor. And we also do ablation studies on $AGG_{v}$ and $AGG_{h}$ , which contain single-orientation information. The results are shown on (Row #2) and (Row#3) in Table 6, we observe that competitive results can still be achieved.

4.5.4 Influence of the Number of Decoder Layers and Recognition Loss

To verify the legitimacy of our recognition head, we do ablation studies on recognition head and recognition loss. The results are shown in Table 5. We note that using just one layer of decoder, the model can still achieve 72.3 % on “General” item. As the number of layers increases, the amount of data and training time for the convergence of recognition head are enhanced. And the model with CE loss performs better than CTC [14] loss with 1.8% improvement. For data and performance considerations, we choose two layers of decoder and CE loss as our recognition head.

4.6 Visualization Analysis

With the help of multi-task modeling, our method generates high-quality polygons and transcriptions for arbitrarily-shaped texts. It works well for vertical and curved texts, as shown in Fig. 6 and Fig. 7. Even for some extreme cases, like highly-occluded text, our method can still rightly recognize them by the proposed AGG module to make full use of the global context rather than local features. These visualization results confirm the superiority of multi-task modeling and mixed-supervision learning manner.

Fig. 8 shows the final segmentation results and attention masks which are generated by Eq. (1). The visual results indicate that our segmentation results and attention masks can focus on the right place. Note that in the third and fourth columns, the word “SALE” is obscured by the other billboard. Our method can still accurately recognize the obscured word. Especially on the fourth column, the character “E” is lost in the segmentation results while being activated in the attention mask, which means our model could predict correct transcription even with the inaccurate segmentation result. We attribute this to our multi-mask modeling design, which alleviates the dependency of the text detection and recognition parts to some extent.

5 Conclusion and Future work

In this paper, we propose a fully end-to-end arbitrary-shaped text spotter based on transformer. By modeling classification, segmentation and recognition tasks in a parallel query-based way, our model can be jointly trained and optimized. And we introduce a light-weight recognition head with AGG module to extract features globally. Without bells and whistles, our method achieves competitive results on both regular text and curved text benchmarks.

Since our model is built upon transformer architecture, its final performance is heavily depend on the scale of training data. In the future work, we may solve this problem by self-supervised learning.

References

[1] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
[2] Fred L. Bookstein. Principal warps: Thin-plate splines and the decomposition of deformations. IEEE Transactions on pattern analysis and machine intelligence, 11(6):567–585, 1989.
[3] Jakob Božič, Domen Tabernik, and Danijel Skočaj. Mixed supervision for surface-defect detection: From weakly to fully supervised learning. Computers in Industry, 129:103459, 2021.
[4] Michal Busta, Lukas Neumann, and Jiri Matas. Deep textspotter: An end-to-end trainable scene text localization and recognition framework. In Proceedings of the IEEE international conference on computer vision, pages 2204–2212, 2017.
[5] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European Conference on Computer Vision, pages 213–229. Springer, 2020.
[6] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017.
[7] Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1290–1299, 2022.
[8] Chee Kheng Ch’ng and Chee Seng Chan. Total-text: A comprehensive dataset for scene text detection and recognition. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), volume 1, pages 935–942. IEEE, 2017.
[9] Chee Kheng Chng, Yuliang Liu, Yipeng Sun, Chun Chet Ng, Canjie Luo, Zihan Ni, ChuanMing Fang, Shuaitao Zhang, Junyu Han, Errui Ding, et al. Icdar2019 robust reading challenge on arbitrary-shaped text-rrc-art. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1571–1576. IEEE, 2019.
[10] Ritendra Datta, Dhiraj Joshi, Jia Li, and James Z Wang. Image retrieval: Ideas, influences, and trends of the new age. ACM Computing Surveys (Csur), 40(2):1–60, 2008.
[11] Shancheng Fang, Zhendong Mao, Hongtao Xie, Yuxin Wang, Chenggang Yan, and Yongdong Zhang. Abinet++: Autonomous, bidirectional and iterative language modeling for scene text spotting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
[12] Wei Feng, Wenhao He, Fei Yin, Xu-Yao Zhang, and Cheng-Lin Liu. Textdragon: An end-to-end framework for arbitrary shaped text spotting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9076–9085, 2019.
[13] Lluís Gómez and Dimosthenis Karatzas. Textproposals: a text-specific selective search algorithm for word spotting in the wild. Pattern recognition, 70:60–74, 2017.
[14] Alex Graves and Alex Graves. Connectionist temporal classification. Supervised sequence labelling with recurrent neural networks, pages 61–93, 2012.
[15] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
[16] Tong He, Zhi Tian, Weilin Huang, Chunhua Shen, Yu Qiao, and Changming Sun. An end-to-end textspotter with explicit alignment and attention. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5020–5029, 2018.
[17] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Reading text in the wild with convolutional neural networks. International journal of computer vision, 116:1–20, 2016.
[18] Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, Suman Ghosh, Andrew Bagdanov, Masakazu Iwamura, Jiri Matas, Lukas Neumann, Vijay Ramaseshan Chandrasekhar, Shijian Lu, et al. Icdar 2015 competition on robust reading. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pages 1156–1160. IEEE, 2015.
[19] Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, Masakazu Iwamura, Lluis Gomez i Bigorda, Sergi Robles Mestre, Joan Mas, David Fernandez Mota, Jon Almazan Almazan, and Lluis Pere De Las Heras. Icdar 2013 robust reading competition. In 2013 12th International Conference on Document Analysis and Recognition, pages 1484–1493. IEEE, 2013.
[20] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017.
[21] Hui Li, Peng Wang, and Chunhua Shen. Towards end-to-end text spotting with convolutional recurrent neural networks. In Proceedings of the IEEE international conference on computer vision, pages 5238–5246, 2017.
[22] Yulin Li, Yuxi Qian, Yuechen Yu, Xiameng Qin, Chengquan Zhang, Yan Liu, Kun Yao, Junyu Han, Jingtuo Liu, and Errui Ding. Structext: Structured text understanding with multi-modal transformers. In Proceedings of the 29th ACM International Conference on Multimedia, pages 1912–1920, 2021.
[23] Minghui Liao, Guan Pang, Jing Huang, Tal Hassner, and Xiang Bai. Mask textspotter v3: Segmentation proposal network for robust scene text spotting. In Proceedings of the European Conference on Computer Vision (ECCV), pages 706–722, 2020.
[24] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
[25] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
[26] Xuebo Liu, Ding Liang, Shi Yan, Dagui Chen, Yu Qiao, and Junjie Yan. Fots: Fast oriented text spotting with a unified network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5676–5685, 2018.
[27] Yuliang Liu, Hao Chen, Chunhua Shen, Tong He, Lianwen Jin, and Liangwei Wang. Abcnet: Real-time scene text spotting with adaptive bezier-curve network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9809–9818, 2020.
[28] Yuliang Liu, Chunhua Shen, Lianwen Jin, Tong He, Peng Chen, Chongyu Liu, and Hao Chen. Abcnet v2: Adaptive bezier-curve network for real-time end-to-end text spotting. arXiv preprint arXiv:2105.03620, 2021.
[29] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
[30] Pu Lu, Hao Wang, Shenggao Zhu, Jing Wang, Xiang Bai, and Wenyu Liu. Boundary textspotter: Toward arbitrary-shaped scene text spotting. IEEE Transactions on Image Processing, 31:6200–6212, 2022.
[31] Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, and Xiang Bai. Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In Proceedings of the European Conference on Computer Vision (ECCV), pages 67–83, 2018.
[32] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV), pages 565–571. Ieee, 2016.
[33] Pawel Mlynarski, Hervé Delingette, Antonio Criminisi, and Nicholas Ayache. Deep learning with mixed supervision for brain tumor segmentation. Journal of Medical Imaging, 6(3):034002–034002, 2019.
[34] Nibal Nayef, Fei Yin, Imen Bizid, Hyunsoo Choi, Yuan Feng, Dimosthenis Karatzas, Zhenbo Luo, Umapada Pal, Christophe Rigaud, Joseph Chazalon, et al. Icdar2017 robust reading challenge on multi-lingual scene text detection and script identification-rrc-mlt. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), volume 1, pages 1454–1459. IEEE, 2017.
[35] Lukáš Neumann and Jiří Matas. Real-time lexicon-free scene text localization and recognition. IEEE transactions on pattern analysis and machine intelligence, 38(9):1872–1885, 2015.
[36] Dezhi Peng, Xinyu Wang, Yuliang Liu, Jiaxin Zhang, Mingxin Huang, Songxuan Lai, Jing Li, Shenggao Zhu, Dahua Lin, Chunhua Shen, et al. Spts: single-point text spotting. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4272–4281, 2022.
[37] Liang Qiao, Ying Chen, Zhanzhan Cheng, Yunlu Xu, Yi Niu, Shiliang Pu, and Fei Wu. Mango: A mask attention guided one-stage scene text spotter. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 2467–2476, 2021.
[38] Liang Qiao, Sanli Tang, Zhanzhan Cheng, Yunlu Xu, Yi Niu, Shiliang Pu, and Fei Wu. Text perceptron: Towards end-to-end arbitrary-shaped text spotting. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 11899–11907, 2020.
[39] Siyang Qin, Alessandro Bissacco, Michalis Raptis, Yasuhisa Fujii, and Ying Xiao. Towards unconstrained end-to-end text spotting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4704–4714, 2019.
[40] Zobeir Raisi, Mohamed A Naiel, Georges Younes, Steven Wardell, and John S Zelek. Transformer-based text detection in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3162–3171, 2021.
[41] Harita Reddy, Namratha Raj, Manali Gala, and Annappa Basava. Text-mining-based fake news detection using ensemble methods. International Journal of Automation and Computing, 17(2):210–221, 2020.
[42] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28:91–99, 2015.
[43] Paul Ricoeur. The model of the text: Meaningful action considered as a text. Social research, pages 529–562, 1971.
[44] Xuejian Rong, Bing Li, J Pablo Munoz, Jizhong Xiao, Aries Arditi, and Yingli Tian. Guided text spotting for assistive blind navigation in unfamiliar indoor environments. In Advances in Visual Computing: 12th International Symposium, ISVC 2016, Las Vegas, NV, USA, December 12-14, 2016, Proceedings, Part II 12, pages 11–22. Springer, 2016.
[45] Yipeng Sun, Zihan Ni, Chee-Kheng Chng, Yuliang Liu, Canjie Luo, Chun Chet Ng, Junyu Han, Errui Ding, Jingtuo Liu, Dimosthenis Karatzas, et al. Icdar 2019 competition on large-scale street view text with partial labeling-rrc-lsvt. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1557–1562. IEEE, 2019.
[46] Yipeng Sun, Chengquan Zhang, Zuming Huang, Jiaming Liu, Junyu Han, and Errui Ding. Textnet: Irregular text reading from images with an end-to-end trainable network. In Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part III 14, pages 83–99. Springer, 2019.
[47] Jingqun Tang, Su Qiao, Benlei Cui, Yuhang Ma, Sheng Zhang, and Dimitrios Kanoulas. You can even annotate text with voice: Transcription-only-supervised text spotting. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4154–4163, 2022.
[48] Jingqun Tang, Wenqing Zhang, Hongye Liu, MingKun Yang, Bo Jiang, Guanglong Hu, and Xiang Bai. Few could be better than all: Feature sampling and grouping for scene text detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4563–4572, 2022.
[49] Shangxuan Tian, Shijian Lu, and Chongshou Li. Wetext: Scene text detection under weak supervision. In Proceedings of the IEEE International Conference on Computer Vision, pages 1492–1500, 2017.
[50] Andreas Veit, Tomas Matera, Lukas Neumann, Jiri Matas, and Serge Belongie. Coco-text: Dataset and benchmark for text detection and recognition in natural images. In arXiv preprint arXiv:1601.07140, 2016.
[51] Hao Wang, Pu Lu, Hui Zhang, Mingkun Yang, Xiang Bai, Yongchao Xu, Mengchao He, Yongpan Wang, and Wenyu Liu. All you need is boundary: Toward arbitrary-shaped text spotting. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 12160–12167, 2020.
[52] Pengfei Wang, Chengquan Zhang, Fei Qi, Shanshan Liu, Xiaoqiang Zhang, Pengyuan Lyu, Junyu Han, Jingtuo Liu, Errui Ding, and Guangming Shi. Pgnet: Real-time arbitrarily-shaped text spotting with point gathering network. arXiv preprint arXiv:2104.05458, 2021.
[53] Wenhai Wang, Xuebo Liu, Xiaozhong Ji, Enze Xie, Ding Liang, ZhiBo Yang, Tong Lu, Chunhua Shen, and Ping Luo. Ae textspotter: Learning visual and linguistic representation for ambiguous text spotting. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pages 457–473. Springer, 2020.
[54] Wenhai Wang, Enze Xie, Xiang Li, Wenbo Hou, Tong Lu, Gang Yu, and Shuai Shao. Shape robust text detection with progressive scale expansion network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9336–9345, 2019.
[55] Wenhai Wang, Enze Xie, Xiang Li, Xuebo Liu, Ding Liang, Yang Zhibo, Tong Lu, and Chunhua Shen. Pan++: Towards efficient and accurate end-to-end spotting of arbitrarily-shaped text. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
[56] Liang Wu, Chengquan Zhang, Jiaming Liu, Junyu Han, Jingtuo Liu, Errui Ding, and Xiang Bai. Editing text in the wild. In Proceedings of the 27th ACM international conference on multimedia, pages 1500–1508, 2019.
[57] Linjie Xing, Zhi Tian, Weilin Huang, and Matthew R Scott. Convolutional character networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9126–9136, 2019.
[58] Yuechen Yu, Yulin Li, Chengquan Zhang, Xiaoqiang Zhang, Zengyuan Guo, Xiameng Qin, Kun Yao, Junyu Han, Errui Ding, and Jingdong Wang. Structextv2: Masked visual-textual prediction for document image pre-training. arXiv preprint arXiv:2303.00289, 2023.
[59] Mingliang Zhai, Yulin Li, Xiameng Qin, Chen Yi, Qunyi Xie, Chengquan Zhang, Kun Yao, Yuwei Wu, and Yunde Jia. Fast-structext: An efficient hourglass transformer with modality-guided dynamic token merge for document understanding. arXiv preprint arXiv:2305.11392, 2023.
[60] Rui Zhang, Yongsheng Zhou, Qianyi Jiang, Qi Song, Nan Li, Kai Zhou, Lei Wang, Dong Wang, Minghui Liao, Mingkun Yang, et al. Icdar 2019 robust reading challenge on reading chinese text on signboard. In 2019 international conference on document analysis and recognition (ICDAR), pages 1577–1581. IEEE, 2019.
[61] Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and Jiajun Liang. East: an efficient and accurate scene text detector. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 5551–5560, 2017.
[62] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.