¹¹institutetext: School of Computer and Communication Engineering, University of Science and Technology Beijing
¹¹email: [email protected], [email protected], [email protected]

SAN: Structure-Aware Network for Complex and Long-tailed Chinese Text Recognition

Junyi Zhang 11 Chang Liu 11 Chun Yang Corresponding Author11

Abstract

In text recognition, complex glyphs and tail classes have always been factors affecting model performance. Specifically for Chinese text recognition, the lack of shape-awareness can lead to confusion among close complex characters. Since such characters are often tail classes that appear less frequently in the training-set, making it harder for the model to capture its shape information. Hence in this work, we propose a structure-aware network utilizing the hierarchical composition information to improve the recognition performance of complex characters. Implementation-wise, we first propose an auxiliary radical branch and integrate it into the base recognition network as a regularization term, which distills hierarchical composition information into the feature extractor. A Tree-Similarity-based weighting mechanism is then proposed to further utilize the depth information in the hierarchical representation. Experiments demonstrate that the proposed approach can significantly improve the performances of complex characters and tail characters, yielding a better overall performance. Code is available at https://github.com/Levi-ZJY/SAN

Keywords:

Structure awareness Text recognition Radical Tree Similarity

1 Introduction

Chinese text recognition plays an important role in the field of text recognition due to its huge audience and market. Most current text recognition methods, including Chinese text recognition methods, are character based, where characters are the basic elements of the prediction. Specifically, most methods fit into the framework formulated by Beak et al. [1], which includes an optional rectifier (Trans.), a feature extractor (Feat.), a sequence modeler (Seq.), and a character classifier (Pred.). Many typical Chinese text recognition methods [9, 25, 29], also fall into this category, where the feature extractor generally takes the form of a Convolutional Neural Network, and the classifier part is mostly implemented as a linear classifier decoding input features into predicted character probabilities. However, the naive classification strategy has limited performance on Chinese samples, due to the large character set, severely unbalanced character frequency, and the complexity of Chinese glyphs.

To address the frequency skew, compositional learning strategies are widely used in low-shot Chinese character recognition tasks [31, 6, 12]. For compositional information exploited, the majority of implementations [31, 20, 2, 21, 24] utilize the radical representation, where the components and the structural information of each character are hierarchically modeled as a tree. Specifically, the basic components serve as leaf nodes and the structural information (the spatial placement of each component) serves as non-leaf nodes. Some methods are also seen to utilize stroke [6] or Wubi [12] representations. Besides the Chinese language, characters in many other languages can be similarly decomposed into basic components [6, 4, 16, 3]. These methods somewhat improve text recognition performance under low-shot scenarios. However, compositional-based methods are rarely seen in regular recognition methods due to their complexity and less satisfactory performance. This limitation is solved by PRAB [7], which proposes to use radical information as a plug-in regularization term. The method decodes the character feature at each timestamp to its corresponding radical sequence and witnesses significant overall performance improvement on several SOTA baselines. However, the method still has two major limitations. First, PRAB [7] only applies to text recognition methods with explicitly aligned character features [18, 13] and does not apply to implicitly aligned CTC-based methods like CRNN [17]. Furthermore, PRAB and most of the aforementioned radical method treats large parts and small parts alike, ignoring the depth information of the hierarchical representation.

To alleviate the limitations, we propose a Structural-Aware Network (SAN) which distills hierarchical composition information into the feature extractor with the proposed alignment-agnostic and depth-aware Auxilary Radical Branch (ARB). The ARB serves as a regularization term which directly refines feature maps extracted by the Feat. stage to preserve the hierarchical composition information without explicitly aligning the feature map to each character. The module allows the model to focus more on local features and learn more structured visual features, which significantly improves complex character recognition accuracy. As basic components are shared among head and tail classes alike, it also improves the tail-classes performance by explicitly exploiting their connections with head classes. Furthermore, we proposed a novel Tree Similarity (TreeSim) semimetric serves as a more fine-grained measure of the visual similarity between two characters. The proposed TreeSim semimetric further allows us to exploit the depth information of the hierarchical representation, which is implemented by weighting the tree nodes accordingly in ARB. The suggested method substantially enhances the accuracy of complex character recognition. As fundamental elements are shared between head and tail classes, it also boosts the performance of tail classes by leveraging their relationships with head classes.

Experiments demonstrate the effectiveness of ARB in optimizing the recognition performance of complex characters and long-tailed characters, and it also improves the overall recognition accuracy of Chinese text. The contributions of this work are as follows:

•

We propose a SAN for complex character recognition by utilizing the hierarchical components information of the character.
•

ARB based on the tree modeling of the label is introduced, which enhances the structure awareness of visual features. ARB shows promising improvement in complex character and long-tailed character recognition and it also improves the overall recognition accuracy of Chinese text.
•

We propose a novel TreeSim method to measure the similarity of two characters, and propose a TreeSim-based weighting mechanism for ARB to further utilize the depth information in the hierarchical representation.

2 Related Work

2.1 Character-based Approaches

In Chinese text recognition, early works are often character-based. Some works are based on CNN model [9, 25, 29] to design improved or integrated methods. For example, MCDNN [9] integrates multiple models including CNN, which shows advantageous performance in handwritten characters recognition. ART-CNN [25] alternatively trains a relaxation CNN to regularize the neural network during the training procedure and achieves state-of-the-art accuracy. Later, DNN-HMM [10] sequentially models the text line and adopts DNN to model the posterior probability of all HMM states, which significantly outperforms the best over-segmentation-based approach [19]. The SOTA text recognition model, ABINet [11], recommends blocking the gradient flow between the vision and language models, and introduces an iterative correction approach for the language model. These strategies promote explicit language modeling and effectively mitigate the influence of noisy input. However, these methods did not put forward effective countermeasures against the difficult problems of Chinese text recognition like many complex characters, insufficient training samples, and so on, so the performance improvement of these models is greatly limited.

2.2 Chinese enhanced Character-based Approaches

Several methods attempt to design targeted optimization strategies according to the characteristics of Chinese text [26, 8, 23, 28, 22, 27]. Wu et al [26]. use MLCA recognition framework and new writing-style-aware image synthesis method to overcome large character sets and great insufficient training samples problems. In [8], the authors apply Maximum Entropy Regularization to regularize the training process to optimize the large amount of fine-grained Chinese characters and the great imbalance over class problems. Wang et al. [23] utilize the similarity of Chinese characters to reduce the total number of HMM states and model the tied states more accurately. These methods pay attention to the particularity of Chinese characters and give targeted optimization. However, they are still character-based, which makes it difficult to further explore the deep features of Chinese characters.

2.3 Radical-based Approaches

In recent years, radical-based approaches have shown outstanding advantages in Chinese recognition [31, 20, 2, 21, 24]. RAN [31] employs an attention mechanism to extract radicals from Chinese characters and to detect spatial structures among the radicals, which reduce the vocabulary size and can recognize unseen characters. FewShotRAN [20] maps each radical to a latent space and uses CAD to analyze the radical representation. HDE [2] designs an embedding vector for each character and learns both radicals and structures of characters via a semantic vector, which achieves superior performance in both traditional HCCR and zero-shot HCCR tasks. Inspired by these works, our method emphasizes the role of structural information of radical in visual perception and proposes to utilize the common local and structural traits between characters to optimize the recognition performance in complex and long-tailed characters.

3 Our Method

Refer to caption — Figure 1: The Structure-Aware Network (SAN). The orange square frame is the ARB-TreeSim and the blue square frame is the base network. The gradient flow of the ARB-TreeSim and the VM decoder influence the feature extractor together.

In this work, we propose the Structure-Aware Network (SAN, Fig. 1) which composes a base method (in blue) and a proposed Auxiliary Radical Branch (ARB). Implementation-wise, we adopt the SOTA ABINet [11] as an example base method. ABINet features separate vision and language models, with no gradient flow occurring between them. The optimization of the vision model and the language model, denoted as $L_{vm}$ and $L_{lm}$ respectively, is carried out independently from each other. Specifically, the ARB is applied to the sample feature map extracted by the Vision Model, yielding a novel two branches learning network. During training, the feature map is processed by both branches, one is passed into the VM decoder, decoded by characters, and the other is passed into ARB, decoded by radical. During evaluation, the model reduces into the corresponding base model, yielding no extra costs.

The following part of this section first introduces the hierarchical representation of characters and the Tree Similarity (TreeSim) metric that measure the similarity of two radical trees. Then we introduce the proposed Auxiliary Radical Branch, which is a regularization term to improve the model shape-awareness by distilling the hierarchical representation. We then introduce the TreeSim enhanced Weighting which adds depth information to the ARB branch.

3.1 Label Modeling

As characters can be decomposed into basic components, we model it as a tree composed of various components hierarchically and model the label as a forest composed of character trees, which gives the label a structural representation.

As shown in Fig.2 (a), in Chinese, one popular modeling method is to decompose characters into radicals and structures [7]. Radicals are the basic components of characters, and structures describe the spatial composition of the radicals in each character. One structure is often associated with two or more radicals, and the structure is always the root node of a subtree. In this way, all Chinese characters can be modeled as a tree, in which leaf nodes are radicals, and non-leaf nodes are structures. For simplicity, we call such trees “radical trees” in this paper.

In this work, we propose a novel Tree Similarity (TreeSim) metric to measure the visual similarity between two characters represented with radical trees. In a radial tree, the upper components represent large and overall information, while the lower components represent small and local information. When comparing the similarity of two characters, humans tend to pay more attention to the upper components and less attention to the lower ones.

Correspondingly, as shown in Fig.2 (b), we first propose a weighting method for TreeSim. The method includes the following three characteristics: First, upper components in the radical tree have greater penalty weight. Second, every subtree is regarded as an independent individual, the root node and its subtrees have equal weight. Third, the total weight of the tree is 1. The calculation procedure is depicted in Algorithm 1, where $Node$ is the pointer of the root node and $w_{sub}$ is the weight of the subtree. The weight of each node can be calculated by the recursion method.

Algorithm 1 TreeSim Weight Algorithm

Node

: Pointer of the root node;

w_{sub}

: Weight of the subtree, which equals to 1 in the initial call;

2:function GetTreeSimWeight(

Node,w_{sub}

)

n\leftarrow

number of children of this node

4: if

n==0

then

5: weight of this node

\leftarrow w_{sub}

6: return

7: else

8: weight of this node

\leftarrow w_{sub}/(n+1)

9: for

c\leftarrow Node.Children

10: GetTreeSimWeight(

c,w_{sub}/(n+1)

)

11: end for

12: end if

13: return

14:end function

Based on this weighting method, the TreeSim calculation sample is shown in Fig.3. The calculation process is as follows: First, select any one of the two radical trees, and traverse every node in preorder. Second, judge whether it is matching for each node. The matching rules include: 1)Every ancestor of this node is matching. 2)This location in the other tree also has a node and the two nodes are the same. Last, TreeSim is equal to the sum of the weights of all matching nodes.

The proposed TreeSim is hierarchical, bidirectional, and deep-independent. Hierarchical means TreeSim pays different attention to different levels of the tree. Bidirectional means no matter select which of the two trees, the calculation result is the same. Depth-independent means the weight of nodes at a certain level is independent of the depth of the tree.

Based on this label modeling method, we proposed two loss design strategies. Considering that a rigorous calculation for the loss of two forests may be complex and nonlinear, we introduce two linear approximate calculation methods in 3.2 to supervise the loss calculation.

3.2 Auxiliary Radical Branch

Inspired by the radical-based methods [20, 7], we first propose an Auxiliary Radical Branch (ARB) module to enhance the structure-awareness of visual features by using the hierarchical representation as a regularization term. Unlike the PRAB [7] method, the proposed ARB module directly decodes the feature map extracted from the Feat. stage, thus do not need individual character features and can be theoretically applied to most methods with a Feat. stage [1]. As shown in Fig.1, ARB includes the following characteristics: 1)The label representation in ARB is hierarchical, which contains component and structural information. 2)ARB is regarded as an independent visual perception optimization branch, which takes feature as input, decodes according to structured representation, and finally refines the visual feature extraction.

The role of ARB is reflected in two aspects: One is that ARB can use hierarchical label representation to supply and refine the feature extraction, improving the structure-awareness of the model. The other is that by taking advantage of the common component combination traits between Chinese characters, ARB could exploit the component information connections between tail classes and head classes, which can optimize long-tailed character recognition.

In ARB, we propose two linear loss design strategies, which reflect the hierarchical label representation while simplifying the design and calculation.

3.2.1 Sequence Modeling

Considering that the proposed label modeling method is top-down and root-first, in loss design, we intend to perpetuate this design mode to match them. We propose using the sequence by traversing the radical tree in preorder to model each tree in the label, so we use the linear radical structure sequence [7]. This sequence follows the root-first design strategy and retains the hierarchical structure information of the radical tree to a certain extent. By connecting each radical structure sequence in the label, we get the sequence modeling label.

The ARB prediction is supervised with the ARB loss $L_{arb}$ , which is a weighted cross-entropy loss of each element.

L_{arb}=\sum_{i}^{{l}_{rad}}w_{rad_{(i)}}*logP(r^{*}_{(i)}),

(1)

where $r^{*}_{i}$ is the ground truth of the $i$ -th radical, $l_{rad}$ is the total radical length of all characters in the sample.

The radical structure sequence contains the component and structural information of the label, so this modeling method implements the introduction of hierarchical information and effectively improves the vision model performance.

3.2.2 Naive Weighting

We first use a naive way to set equal weights to radicals in each character, i.e, $\mathbf{w}_{rad}=\mathbf{w}_{naive}=\mathbf{1}.$ Albeit the Naive method demonstrates some extent of performance improvement, it treats all components as equal importance, which may lead to over-focusing on minor details.

3.2.3 TreeSim Enhanced Weighting

In sequence modeling, the structural information is implicit, we want to make it more explicit. So we propose TreeSim enhanced Weighting method to explicitly strengthen the structural information of label representation.

We add the TreeSim weight (Fig.2(b)) to the naive weight as a regularization term and get the final radical weight.

\mathbf{w}_{rad}=\mathbf{w}_{naive}+\lambda_{TreeSim}\mathbf{w}_{TreeSim}

(2)

where $\lambda_{TreeSim}$ is set to 1. This method enhanced the structural information in loss explicitly, which shows better performance on the vision model than the Naive method.

3.3 Optimization

The training gradient flow by two branches is superimposed and updates the weight of the feature extractor together, which allows the radical level prediction model to supplement and refine the visual feature extraction of the character prediction model in the training process. we define the loss as follows

L_{overall}=L_{base}+L_{arb},

(3)

where $L_{base}$ is the loss of base model, which is the loss from ABINet in this work, i.e.,

L_{base}=L_{vm}+L_{lm}.

(4)

Both the character prediction branch and radical structure prediction branch affect the feature extraction. For fairness, we give them the same weight of influence on visual feature extraction, to ensure that the performance of both parties is fully reflected.

4 Experiment

4.1 Implementation Details

The datasets used in our experiments are Web Dataset and Scene Dataset [7]. The Web Dataset contains 20,000 Chinese and English web text images from 17 different categories on the Taobao website and the Scene Dataset contains 636,455 text images from several competitions, papers, and projects. The number of radical classes is 960. We set the value of R (Fig.1) to be 33 for the web dataset and 39 for the scene dataset, covering a substantial 95 $\%$ of the training samples from both datasets.

We implement our method with PyTorch and conduct experiments on three NVIDIA RTX 3060 GPUs. Each input Image is resized to $32\times 128$ with data augmentation. The batch size is set to 96 and ADAM optimizer is adopted with the initial learning rate of $1e^{-4}$ , which is decayed to $1e^{-5}$ after 12 epochs.

4.2 Ablative Study

Table 1: Ablation study of ARB. The dataset used is Web dataset [7].

Model	Naive Radical loss	TreeSim weighing loss	Accuracy	1-NED
VM-Base			59.1	0.768
VM-Naive	✓		60.7	0.779
VM-TreeSim	✓	✓	61.3	0.786
ABINet-Base			65.6	0.803
ABINet-Naive	✓		66.8	0.812
ABINet-TreeSim (SAN)	✓	✓	67.3	0.817

We discuss the performance of the proposed approaches (ARB and TreeSim weighting) with two base-model configurations: Vision Model and ABINet. Experiment results are recorded in Tab.1.

First, the proposed ARB is proved useful, by significantly improving the accuracy of both base-metods. Specifically, the VM-Naive outperforms VM-Base by 1.6 $\%$ accuracy and 0.011 1-NED, ABINet-Naive outperforms ABINet-Base by 1.2 $\%$ accuracy and 0.009 1-NED. Second, TreeSim enhanced weighting also achieves expected improvements. VM-TreeSim boosts accuracy by 0.6 $\%$ and 1-NED by 0.007 than VM-Naive, SAN boosts accuracy by 0.5 $\%$ and 1-NED by 0.005 than ABINet-Naive.

The above observations suggest that the hierarchical components information is useful to the feature extractor, which can significantly improve the performance of the vision model. The proposed approaches also yields significant improvement against full ABINet indicating hierarchical components information still plays a significant part, even when the language models can, to some extent, alleviate the confusions caused by insufficiency shape awareness.

4.3 Comparative Experiments

Table 2: Performance on Chinese text recognition benchmarks [7].

\dagger

indicates results reported by [7],

*

indicates results from our experiments.

Method	Web		Scene
	Accuracy	1-NED	Accuracy	1-NED
CRNN $\dagger$ [17]	54.5	0.736	53.4	0.734
ASTER $\dagger$ [18]	52.3	0.689	54.5	0.695
MORAN $\dagger$ [14]	49.9	0.682	51.8	0.686
SAR $\dagger$ [13]	54.3	0.725	62.5	0.785
SRN $\dagger$ [30]	52.3	0.706	60.1	0.778
SEED $\dagger$ [15]	46.3	0.637	49.6	0.661
TransOCR $\dagger$ [5]	62.7	0.782	67.8	0.817
TransOCR-PRAB $\dagger$ [7]	63.8	0.794	71.0	0.843
ABINet $*$ [11]	65.6	0.803	71.8	0.853
SAN (Ours)	67.3	0.817	73.6	0.863

Compared with Chinese text recognition benchmarks and recent SOTA works that are trained on web and scene datasets, SAN also shows impressive performance(Tab.2). We can see from the comparison, our SAN outperforms ABINet with 1.7 $\%$ , 1.8 $\%$ on Web and Scene datasets respectively. Also, SAN achieves the best 1-NED on both datasets. Some successful recognition examples are shown in Fig.4, which shows the complex characters and long-tailed characters that predict failure using the base model (ABINet) while predicted successfully by the full model.

4.4 Property analysis

To validate the optimization of ARB on complex character recognition and long-tailed character recognition, we divide them into different groups according to their complexity and frequency. For each category, to give a better understanding of the differences in recognizing performances, the character prediction samples by SAN and ABINet are studied in more detail. Specifically, we calculated the accuracy of character prediction results and the average TreeSim between the predicted results and the ground truth. TreeSim serves as a more fine-grained metric compared to accuracy, as it can indicate the awareness of structural information.

4.4.1 Experiments on different complexity characters

We observe that the complexity of the character structure is proportional to its radical structure sequence length (RSSL), so we use RSSL to represent the complexity of each character. According to observations, we denote characters with complexity of RSSL5 (characters that RSSL is equal to 5) and RSSL6 as medium complexity characters, characters with longer RSSL are denoted as complex and characters with shorter RSSL are considered as simple. Accodingly, we divide character classes into three parts: RSSL $\leq{}$ 4 (average 34 $\%$ in web and 30 $\%$ in scene), 5 $\leq{}$ RSSL $\leq{}$ 6 (average 38 $\%$ in web and 37 $\%$ in scene), RSSL $\geq$ 7 (average 28 $\%$ in web and 33 $\%$ in scene), and call them simple characters, sub-complex characters, and complex characters respectively.

We calculate the accuracy and the average TreeSim between character prediction samples and ground truth values within each of these three parts (Fig.5). In terms of accuracy, we note that in both the web dataset (Fig. 5(a)) and the scene dataset (Fig. 5(b)), there is a consistent trend of higher improvement in accuracy for sub-complex (5 $\leq{}$ RSSL $\leq{}$ 6) and complex (RSSL $\geq{}$ 7) characters compared to simple characters (RSSL $\leq{}$ 4). Regarding TreeSim, the average TreeSim in all three parts consistently increases. In the web dataset, the rising trend becomes more pronounced as RSSL increases, with the growth of complex characters being particularly noticeable. In the scene dataset, the increase in sub-complex characters is extremely significant, and the growth rates of both sub-complex and complex characters surpass that of simple characters.

The experiments show that ARB can prominently improve the similarity between the recognition results and ground trues of complex characters, indicating that ARB has a more distinct perception of complex characters, which is due to the introduction of hierarchical structure information improving the structure-awareness of the vision model and making it easier for models to distinguish different components in complex characters.

4.4.2 Experiments on long-tailed characters

To validate the feasibility of our model on long-tailed characters, we sort character prediction samples in descending according to their occurrence number (OccN) in the training dataset, and divide them into four parts: OccN $\geq$ 100 (average 22 $\%$ in web and 39 $\%$ in scene), 100 $>$ OccN $\geq$ 50 (average 12 $\%$ in web and 10 $\%$ in scene), 50 $>$ OccN $\geq$ 20 (average 19 $\%$ in web and 15 $\%$ in scene), 20 $>$ OccN $\geq$ 0 (average 47 $\%$ in web and 36 $\%$ in scene).

We calculate the accuracy and the average TreeSim between the character prediction samples and their corresponding ground truth within each of these parts (Fig.6). In terms of accuracy, we observe that for both the web dataset (Fig. 6(a)) and the scene dataset (Fig. 6(b)), the improvement in accuracy for infrequent characters (OccN $<$ 50) is consistently higher than that for frequent characters (OccN $\geq$ 50). Regarding TreeSim, the rising trend becomes more pronounced as the occurrence number decreases. The average TreeSim of tail classes (20 $>$ OccN $\geq$ 0) exhibits the most noticeable increase in both datasets.

These results demonstrate that the ARB can make the recognition results of long-tailed characters more similar to the ground truths, indicating that ARB can learn more features of long-tailed characters, which is because of the common components combination traits shared between characters, ARB can take advantage of the traits learning on head classes to optimize the tail classes recognition.

5 Conclusion

In this paper, we propose a Structure-Aware Network (SAN) to optimize the recognition performance of complex characters and long-tailed characters, by using the proposed Auxiliary Radical Branch (ARB) which utilizes the hierarchical components information of characters. Besides, we propose using Tree Similarity (TreeSim) to measure the similarity of two characters and using TreeSim weight to enhance the structural information of label representation. Experiment results demonstrate the superiority of ARB on complex and long-tailed character recognition and validate that our method outperforms standard benchmarks and recent SOTA works.

6 Acknowledgement

The research is supported by National Key Research and Development Program of China (2020AAA0109700), National Science Fund for Distinguished Young Scholars (62125601), National Natural Science Foundation of China (62076024, 62006018), Interdisciplinary Research Project for Young Teachers of USTB (Fundamental Research Funds for the Central Universities)(FRF-IDRY-21-018).

References

[1] Baek, J., Kim, G., Lee, J., Park, S., Han, D., Yun, S., Oh, S.J., Lee, H.: What is wrong with scene text recognition model comparisons? dataset and model analysis. In: ICCV. pp. 4714–4722 (2019)
[2] Cao, Z., Lu, J., Cui, S., Zhang, C.: Zero-shot handwritten chinese character recognition with hierarchical decomposition embedding. Pattern Recognit. 107, 107488 (2020)
[3] Chanda, S., Baas, J., Haitink, D., Hamel, S., Stutzmann, D., Schomaker, L.: Zero-shot learning based approach for medieval word recognition using deep-learned features. In: 16th International Conference on Frontiers in Handwriting Recognition, ICFHR 2018, Niagara Falls, NY, USA, August 5-8, 2018. pp. 345–350. IEEE Computer Society (2018). https://doi.org/10.1109/ICFHR-2018.2018.00067, https://doi.org/10.1109/ICFHR-2018.2018.00067
[4] Chanda, S., Haitink, D., Prasad, P.K., Baas, J., Pal, U., Schomaker, L.: Recognizing bengali word images - a zero-shot learning perspective. In: 2020 25th International Conference on Pattern Recognition (ICPR). pp. 5603–5610 (2021). https://doi.org/10.1109/ICPR48806.2021.9412607
[5] Chen, J., Li, B., Xue, X.: Scene text telescope: Text-focused scene image super-resolution. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 12021–12030 (2021)
[6] Chen, J., Li, B., Xue, X.: Zero-shot chinese character recognition with stroke-level decomposition. In: IJCAI. pp. 615–621 (2021)
[7] Chen, J., Yu, H., Ma, J., Guan, M., Xu, X., Wang, X., Qu, S., Li, B., Xue, X.: Benchmarking chinese text recognition: Datasets, baselines, and an empirical study. arXiv preprint arXiv:2112.15093 (2021)
[8] Cheng, C., Xu, W., Bai, X., Feng, B., Liu, W.: Maximum entropy regularization and chinese text recognition. ArXiv abs/2007.04651 (2020)
[9] Ciresan, D.C., Meier, U.: Multi-column deep neural networks for offline handwritten chinese character classification. 2015 International Joint Conference on Neural Networks (IJCNN) pp. 1–6 (2013)
[10] Du, J., Wang, Z., Zhai, J.F., Hu, J.: Deep neural network based hidden markov model for offline handwritten chinese text recognition. 2016 23rd International Conference on Pattern Recognition (ICPR) pp. 3428–3433 (2016)
[11] Fang, S., Xie, H., Wang, Y., Mao, Z., Zhang, Y.: Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. In: CVPR. pp. 7098–7107 (2021)
[12] He, S., Schomaker, L.: Open set chinese character recognition using multi-typed attributes. arXiv preprint arXiv:1808.08993 (2018)
[13] Li, H., Wang, P., Shen, C., Zhang, G.: Show, attend and read: A simple and strong baseline for irregular text recognition. In: AAAI. pp. 8610–8617 (2019)
[14] Luo, C., Jin, L., Sun, Z.: MORAN: A multi-object rectified attention network for scene text recognition. Pattern Recognition 90, 109–118 (2019)
[15] Qiao, Z., Zhou, Y., Yang, D., Zhou, Y., Wang, W.: SEED: semantics enhanced encoder-decoder framework for scene text recognition. In: CVPR. pp. 13525–13534 (2020)
[16] Rai, A., Krishnan, N.C., Chanda, S.: Pho(sc)net: An approach towards zero-shot word image recognition in historical documents. ArXiv abs/2105.15093 (2021)
[17] Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(11), 2298–2304 (2017)
[18] Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., Bai, X.: ASTER: an attentional scene text recognizer with flexible rectification. IEEE Trans. Pattern Anal. Mach. Intell. 41(9), 2035–2048 (2019)
[19] Wang, Q.F., Yin, F., Liu, C.L.: Handwritten chinese text recognition by integrating multiple contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 1469–1481 (2012)
[20] Wang, T., Xie, Z., Li, Z., Jin, L., Chen, X.: Radical aggregation network for few-shot offline handwritten chinese character recognition. Pattern Recognit. Lett. 125, 821–827 (2019)
[21] Wang, W., shu Zhang, J., Du, J., Wang, Z., Zhu, Y.: Denseran for offline handwritten chinese character recognition. 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR) pp. 104–109 (2018)
[22] Wang, Z., Du, J.: Joint architecture and knowledge distillation in cnn for chinese text recognition. Pattern Recognit. 111, 107722 (2019)
[23] Wang, Z., Du, J., Wang, J.: Writer-aware cnn for parsimonious hmm-based offline handwritten chinese text recognition. ArXiv abs/1812.09809 (2018)
[24] Wu, C.J., Wang, Z., Du, J., shu Zhang, J., Wang, J.: Joint spatial and radical analysis network for distorted chinese character recognition. 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW) 5, 122–127 (2019)
[25] Wu, C., liang Fan, W., He, Y., Sun, J., Naoi, S.: Handwritten character recognition by alternately trained relaxation convolutional neural network. 2014 14th International Conference on Frontiers in Handwriting Recognition pp. 291–296 (2014)
[26] Wu, Y., Hu, X.: From textline to paragraph: A promising practice for chinese text recognition. In: Proceedings of the Future Technologies Conference. pp. 618–633. Springer (2020)
[27] Xiao, X., Jin, L., Yang, Y., Yang, W., Sun, J., Chang, T.: Building fast and compact convolutional neural networks for offline handwritten chinese character recognition. Pattern Recognit. 72, 72–81 (2017)
[28] Xiao, Y., Meng, D., Lu, C., Tang, C.K.: Template-instance loss for offline handwritten chinese character recognition. 2019 International Conference on Document Analysis and Recognition (ICDAR) pp. 315–322 (2019)
[29] Yin, F., Wang, Q.F., Zhang, X.Y., Liu, C.L.: Icdar 2013 chinese handwriting recognition competition. 2013 12th International Conference on Document Analysis and Recognition pp. 1464–1470 (2013)
[30] Yu, D., Li, X., Zhang, C., Liu, T., Han, J., Liu, J., Ding, E.: Towards accurate scene text recognition with semantic reasoning networks. In: CVPR. pp. 12110–12119 (2020)
[31] shu Zhang, J., Zhu, Y., Du, J., Dai, L.: Ran: Radical analysis networks for zero-shot learning of chinese characters. ArXiv abs/1711.01889 (2017)