Utilizing Resource-Rich Language Datasets for End-to-End
Scene Text Recognition in Resource-Poor Languages

Shota Orihashi, Yoshihiro Yamazaki, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, and Ryo Masumura NTT Computer and Data Science Laboratories, NTT Corporation, Japan

(2021)

Abstract.

This paper presents a novel training method for end-to-end scene text recognition. End-to-end scene text recognition offers high recognition accuracy, especially when using the encoder-decoder model based on Transformer. To train a highly accurate end-to-end model, we need to prepare a large image-to-text paired dataset for the target language. However, it is difficult to collect this data, especially for resource-poor languages. To overcome this difficulty, our proposed method utilizes well-prepared large datasets in resource-rich languages such as English, to train the resource-poor encoder-decoder model. Our key idea is to build a model in which the encoder reflects knowledge of multiple languages while the decoder specializes in knowledge of just the resource-poor language. To this end, the proposed method pre-trains the encoder by using a multilingual dataset that combines the resource-poor language’s dataset and the resource-rich language’s dataset to learn language-invariant knowledge for scene text recognition. The proposed method also pre-trains the decoder by using the resource-poor language’s dataset to make the decoder better suited to the resource-poor language. Experiments on Japanese scene text recognition using a small, publicly available dataset demonstrate the effectiveness of the proposed method.

scene text recognition, pre-training, Transformer, resource-poor language

^†^†journalyear: 2021^†^†copyright: acmcopyright^†^†conference: ACM Multimedia Asia; December 1–3, 2021; Gold Coast, Australia^†^†booktitle: ACM Multimedia Asia (MMAsia ’21), December 1–3, 2021, Gold Coast, Australia^†^†price: 15.00^†^†doi: 10.1145/3469877.3490571^†^†isbn: 978-1-4503-8607-4/21/12^†^†ccs: Computing methodologies Object recognition^†^†ccs: Computing methodologies Natural language processing

1. Introduction

Natural scene images contain a lot of textual information such as store advertising signs and traffic signs. Scene text recognition is the task of identifying texts present in images such as signs detected from natural scene images. Scene text recognition can be applied to many tasks such as image classification (Karaoglu et al., 2017a; Karaoglu et al., 2017b; Bai et al., 2018), image retrieval (Karaoglu et al., 2017a; Gómez et al., 2018), and visual question answering (Biten et al., 2019). Due to the appeal of these various potential applications, research and development of scene text recognition technology is being actively carried out in academic and industrial fields.

With improvements in deep learning technology, many scene text recognition methods have been proposed (Long et al., 2021). Most of the conventional methods are based on end-to-end neural networks. Typically, the input image is converted into continuous representations by using convolutional neural networks (CNNs) and bidirectional long short-term memory recurrent neural networks (BLSTM-RNNs). The obtained continuous representations are then subjected to connectionist temporal classification (CTC) (Graves et al., 2006) which yields a character string (Shi et al., 2016a; Liu et al., 2016; Wang and Hu, 2017; Borisyuk et al., 2018). Other methods use encoder-decoder models that utilize LSTM and attention mechanisms instead of CTC for sequence-to-sequence conversion (Shi et al., 2016b; Lee and Osindero, 2016; Baek et al., 2019). Also, in recent years, due to improvements in deep learning technology for natural language processing, high recognition accuracy is possible from encoder-decoder models that utilize Transformer (Vaswani et al., 2017); sequence-to-sequence conversion is realized solely by attention mechanisms, not by RNNs such as LSTM (Sheng et al., 2019; Zhu et al., 2019; Wang et al., 2019; Yu et al., 2020; Bleeker and de Rijke, 2020; Lu et al., 2021). These methods extract continuous representations, which capture the features of the input image, in the encoder by using CNN and Transformer encoder, while the decoder translates the continuous representations into character strings by using Transformer decoder.

To train a highly accurate end-to-end encoder-decoder scene text recognition model, a large image-to-text paired training dataset in the target language is required. While various well-prepared large public datasets are available for English (Mishra et al., 2012; Jaderberg et al., 2014; Gupta et al., 2016), there are few public datasets for minor languages such as Japanese. In fact, the public Japanese scene text dataset provided for the ICDAR2019 robust reading challenge on multilingual scene text detection and recognition (Nayef et al., 2019) has a relatively small amount of data in Japanese (10K order) compared to publicly available English datasets (1M order). Of course, several methods have been proposed to synthesize training data (Gupta et al., 2016; Zhan et al., 2018; Long and Yao, 2020), but these methods require the preparation of an appropriate corpus and images in the target language and the target domain in advance. These preparations also require expertise and are costly. Therefore, we need a training method that can yield an accurate scene text recognition model for the target language from small datasets in the target language.

In this paper, we present a novel training method for an encoder-decoder scene text recognition model for resource-poor languages. Our method utilizes well-prepared large datasets in resource-rich languages such as English to train a scene text recognition model for resource-poor target languages. Our key idea is to build a model in which the encoder reflects the knowledge available in multiple languages for scene text recognition including a variety of background images and character string shapes such as curved and tilted, and the decoder specializes in just the knowledge of the resource-poor language. To this end, the proposed method pre-trains the encoder of the model by using a multilingual dataset, a combination of the resource-poor language’s dataset and the resource-rich language’s dataset, to learn language-invariant knowledge for scene text recognition. Our method also pre-trains the decoder of the model by the resource-poor language’s dataset to ensure that the decoder is specific for the resource-poor language. The proposed method finally fine-tunes the pre-trained encoder and decoder in the resource-poor language. Our method enables us to train the model efficiently without a large dataset in the resource-poor target language. Experiments on a small publicly available Japanese dataset (Nayef et al., 2019) and a large English dataset (Jaderberg et al., 2014) demonstrate the effectiveness of the proposed method.

Our contributions are summarized as follows:

•

We provide a training method for an end-to-end encoder-decoder scene text recognition model for resource-poor languages that utilizes well-prepared large datasets in resource-rich languages effectively. To the best of our knowledge, while training methods utilizing multilingual data for end-to-end models have been proposed in the fields of speech and language processing (Adams et al., 2019; Lample and Conneau, 2019; Liu et al., 2020), ours is the first work to utilize multilingual data in training a scene text recognition model.
•

We conduct experiments on Japanese scene text recognition using highly accurate Transformer-based scene text recognition models (Sheng et al., 2019; Wang et al., 2019) with a detailed ablation study that verifies the effectiveness of the proposed approach. The experiments show that even a small amount of resource-rich language’s data improves performance in the resource-poor language.

2. Transformer-Based Scene Text Recognition

This section describes scene text recognition based on Transformer (Sheng et al., 2019; Wang et al., 2019). Transformer (Vaswani et al., 2017) was originally proposed for machine translation and is based solely on attention mechanisms; it has been successful in various natural language processing tasks. In recent years, inspired by the machine translation model, scene text recognition methods based on Transformer have been proposed (Sheng et al., 2019; Zhu et al., 2019; Wang et al., 2019; Yu et al., 2020; Bleeker and de Rijke, 2020; Lu et al., 2021). High recognition accuracy has been obtained due to its powerful language modeling abilities.

Scene text recognition is a task that estimates character string ${\mbox{\boldmath$C$}}=\{c_{1},\cdots,c_{T}\}$ from character image $I$ , where $c_{t}$ is the $t$ -th character of the string, and $T$ is the number of characters. In the auto-regressive encoder-decoder recognition model based on Transformer, the generation probability of $C$ from $I$ is modeled as

(1)

P({\mbox{\boldmath$C$}}\mid{\mbox{\boldmath$I$}};{\mbox{\boldmath$\Theta$}})=\prod_{t=1}^{T}P(c_{t}\mid c_{1:t-1},{\mbox{\boldmath$I$}};{\mbox{\boldmath$\Theta$}}),

where $c_{1:t-1}=\{c_{1},\cdots,c_{t-1}\}$ , and ${\mbox{\boldmath$\Theta$}}=\{{\mbox{\boldmath$\theta$}}_{\rm enc},{\mbox{\boldmath$\theta$}}_{\rm dec}\}$ represents the trainable model parameter set.

Figure 1 shows network structures for a scene text recognition model based on Transformer (Sheng et al., 2019; Wang et al., 2019). As shown in Figure 1, the scene text recognition model based on Transformer consists of an encoder and a decoder.

Refer to caption — (a) modeling by Sheng et al. (Sheng et al., 2019)

2.1. Encoder

In the encoder, input image $I$ is converted into continuous vectors ${\mbox{\boldmath$Q$}}=\{{\mbox{\boldmath$q$}}_{1},\cdots,{\mbox{\boldmath$q$}}_{J}\}$ as

(2)

{\mbox{\boldmath$P$}}={\rm CNNFeatureExtractor}({\mbox{\boldmath$I$}};{\mbox{\boldmath$\theta$}}_{\rm enc}),

(3)

{\mbox{\boldmath$Q$}}={\rm Reshape}({\mbox{\boldmath$P$}}),

where ${\rm CNNFeatureExtractor}()$ is a function that extracts image features by using a CNN-based model, ${\rm Reshape}()$ is a function that translates three-dimensional image features ( ${\rm width}\times{\rm height}\times{\rm channels}$ ) into two-dimensional vectors ( ${\rm width}\times({\rm height}\times{\rm channels})$ ), and $J$ is the width of the continuous vectors. The encoder proposed by Wang et al. (Wang et al., 2019) outputs continuous vectors $Q$ , see Figure 1 (b). On the other hand, in the encoder proposed by Sheng et al. (Sheng et al., 2019) shown in Figure 1 (a), continuous vectors $Q$ are then projected into ${\mbox{\boldmath$R$}}=\{{\mbox{\boldmath$r$}}_{1},\cdots,{\mbox{\boldmath$r$}}_{J}\}$ for input to the Transformer encoder block as

(4)

{\mbox{\boldmath$r$}}_{j}={\rm AddPosEnc}({\mbox{\boldmath$q$}}_{j}),

where ${\rm AddPosEnc}()$ is a function that adds a continuous vector in which position information is embedded. The Transformer encoder composes continuous vectors ${\mbox{\boldmath$S$}}^{(K)}$ from $R$ by using $K$ Transformer encoder blocks. The $k$ -th Transformer encoder block forms the $k$ -th continuous vectors ${\mbox{\boldmath$S$}}^{(k)}$ from the lower layer inputs ${\mbox{\boldmath$S$}}^{(k-1)}$ as

(5)

{\mbox{\boldmath$S$}}^{(k)}={\rm TransformerEnc}({\mbox{\boldmath$S$}}^{(k-1)};{\mbox{\boldmath$\theta$}}_{\rm enc}),

where ${\mbox{\boldmath$S$}}^{(0)}={\mbox{\boldmath$R$}}$ , and ${\rm TransformerEnc}()$ is a Transformer encoder block that consists of a scaled dot product multi-head self-attention layer and a position-wise feed-forward network (Vaswani et al., 2017). The encoder proposed by Sheng et al. (Sheng et al., 2019) outputs continuous vectors ${\mbox{\boldmath$S$}}^{(K)}$ .

2.2. Decoder

The decoder computes the generation probability of a character string from the preceding character string and continuous vectors output from the encoder. The predicted probabilities of the $t$ -th character, $c_{t}$ , are calculated as

(6)

P(c_{t}\mid c_{1:t-1},{\mbox{\boldmath$I$}};{\mbox{\boldmath$\Theta$}})={\rm Softmax}({\mbox{\boldmath$u$}}_{t-1}^{(L)};{\mbox{\boldmath$\theta$}}_{\rm dec}),

where ${\rm Softmax}()$ is a softmax layer with linear transformation. The Transformer decoder forms hidden representation ${\mbox{\boldmath$u$}}_{t-1}^{(L)}$ from encoder output $V$ by using $L$ Transformer decoder blocks, where

(7)

{\mbox{\boldmath$V$}}=\begin{cases}{\mbox{\boldmath$S$}}^{(K)}&\text{in modeling by Sheng {\it et al.} \cite[citep]{(\@@bibref{AuthorsPhrase1Year}{Sheng19}{\@@citephrase{, }}{})}},\\ {\mbox{\boldmath$Q$}}&\text{in modeling by Wang {\it et al.} \cite[citep]{(\@@bibref{AuthorsPhrase1Year}{Wang19}{\@@citephrase{, }}{})}}.\end{cases}

The $l$ -th Transformer decoder block forms the $l$ -th hidden representation ${\mbox{\boldmath$u$}}_{t-1}^{(l)}$ from the lower layer inputs ${\mbox{\boldmath$U$}}_{1:t-1}^{(l-1)}=\{{\mbox{\boldmath$u$}}_{1}^{(l-1)},\cdots,{\mbox{\boldmath$u$}}_{t-1}^{(l-1)}\}$ as

(8)

{\mbox{\boldmath$u$}}_{t-1}^{(l)}={\rm TransformerDec}({\mbox{\boldmath$U$}}_{1:t-1}^{(l-1)},{\mbox{\boldmath$V$}};{\mbox{\boldmath$\theta$}}_{\rm dec}),

where ${\rm TransformerDec}()$ is a Transformer decoder block that consists of a scaled dot product multi-head self-attention layer, a scaled dot product multi-head source-target attention layer, and a position-wise feed-forward network (Vaswani et al., 2017). The hidden representation ${\mbox{\boldmath$u$}}_{t-1}^{(0)}$ is given by

(9)

{\mbox{\boldmath$u$}}_{t-1}^{(0)}={\rm AddPosEnc}({\mbox{\boldmath$c$}}_{t-1}),

(10)

{\mbox{\boldmath$c$}}_{t-1}={\rm Embedding}(c_{t-1};{\mbox{\boldmath$\theta$}}_{\rm dec}),

where ${\rm Embedding}()$ is a linear layer that embeds the input character into a continuous vector.

2.3. Training

To train the model, the target language’s dataset $\mathcal{D}=\{({\mbox{\boldmath$I$}}_{1},{\mbox{\boldmath$C$}}_{1}),\cdots,\\ ({\mbox{\boldmath$I$}}_{N},{\mbox{\boldmath$C$}}_{N})\}$ is used, where $N$ is the number of data points. The optimization of model parameters is represented as

(11)

\hat{{\mbox{\boldmath$\theta$}}}_{\rm enc},\hat{{\mbox{\boldmath$\rm\theta$}}}_{\rm dec}=\mathop{\rm arg~{}min}\limits_{\mbox{\boldmath$\Theta$}}\sum_{n=1}^{N}\bigl{\{}-\log{P({\mbox{\boldmath$C$}}_{n}\mid{\mbox{\boldmath$I$}}_{n};{\mbox{\boldmath$\Theta$}})}\bigr{\}},

where $\hat{{\mbox{\boldmath$\theta$}}}_{\rm enc}$ and $\hat{{\mbox{\boldmath$\theta$}}}_{\rm dec}$ are trained parameters.

In this paper, to train an accurate scene text recognition model for the target language when a large dataset in the target language is not available, i.e. the number of data points, $N$ , for the target language is small, we propose a training method that utilize not only the dataset in the resource-poor target languages but also a well-prepared dataset of a resource-rich language.

3. Proposed Method

This section details the proposed method. The proposed method trains the encoder-decoder scene text recognition model based on Transformer described in Section 2 for the resource-poor target language.

The main idea of the proposed method is to utilize a well-prepared dataset of a resource-rich language such as English to train the recognition model for the resource-poor language. The proposed method does not simply train the model on a dataset composed of two languages, but trains the resource-poor language’s model by using efficient combinations of the resource-poor language’s dataset and the resource-rich language’s dataset. In detail, the proposed method optimizes two models in pre-training at first, and then a part of each model is used as a pre-trained encoder and a pre-trained decoder for fine-tuning. For encoder pre-training, the proposed method utilizes the multilingual dataset composed of the resource-poor language’s dataset and the resource-rich language’s dataset. By this approach, the encoder can be trained by using a larger dataset than that possible when using only the resource-poor language’s dataset. Therefore, the encoder can learn features of images beyond languages including a variety of images and character string shapes such as curved and tilted from the larger dataset, which improves robustness effectively. On the other hand, for decoder pre-training, the proposed method utilizes only the resource-poor language’s dataset. This approach specializes the decoder for the resource-poor language. Since the decoder translates the image features captured by the encoder into character strings, recognition accuracy can be improved by specializing the decoder for the resource-poor language. Finally, the pre-trained encoder and decoder are fine-tuned by using the resource-poor language’s dataset.

Figure 2 outlines the proposed method. The proposed method consists of two training phases; pre-training and fine-tuning.

3.1. Pre-Training

In the pre-training phase, the proposed method uses not only the resource-poor language’s dataset $\mathcal{D}^{\rm rp}=\{({\mbox{\boldmath$I$}}^{\rm rp}_{1},{\mbox{\boldmath$C$}}^{\rm rp}_{1}),\cdots,({\mbox{\boldmath$I$}}^{\rm rp}_{N},{\mbox{\boldmath$C$}}^{\rm rp}_{N})\}$ , but also a well-prepared large dataset in a resource-rich language $\mathcal{D}^{\rm rr}=\{({\mbox{\boldmath$I$}}^{\rm rr}_{1},{\mbox{\boldmath$C$}}^{\rm rr}_{1}),\cdots,({\mbox{\boldmath$I$}}^{\rm rr}_{M},{\mbox{\boldmath$C$}}^{\rm rr}_{M})\}$ , where $N$ and $M$ are the number of data points in the resource-poor language’s dataset and the resource-rich language’s dataset, respectively. We assume $N<M$ . The proposed method pre-trains the encoder and the decoder by using the multilingual dataset and the resource-poor language’s dataset, respectively.

As for encoder pre-training, the multilingual model that consists of multilingual encoder (ME) and multilingual decoder (MD) is trained. Training uses a multilingual dataset made by combining the resource-poor language’s dataset $\mathcal{D}^{\rm rp}$ with the resource-rich language’s dataset $\mathcal{D}^{\rm rr}$ . The optimization of model parameters is given by

(12)

\begin{split}\dot{{\mbox{\boldmath$\theta$}}}_{\rm enc},\dot{{\mbox{\boldmath$\theta$}}}_{\rm dec}=\mathop{\rm arg~{}min}\limits_{\mbox{\boldmath$\Theta$}}\left[\sum_{n=1}^{N}\bigl{\{}-\log{P({\mbox{\boldmath$C$}}^{\rm rp}_{n}\mid{\mbox{\boldmath$I$}}^{\rm rp}_{n};{\mbox{\boldmath$\Theta$}})\bigr{\}}}\right.\\ \left.+\sum_{m=1}^{M}\bigl{\{}-\log{P({\mbox{\boldmath$C$}}^{\rm rr}_{m}\mid{\mbox{\boldmath$I$}}^{\rm rr}_{m};{\mbox{\boldmath$\Theta$}})}\bigr{\}}\right],\end{split}

where $\dot{{\mbox{\boldmath$\theta$}}}_{\rm enc}$ and $\dot{{\mbox{\boldmath$\theta$}}}_{\rm dec}$ are the trained parameters of ME and MD in encoder pre-training.

As for decoder pre-training, the resource-poor language model that consists of resource-poor language encoder (RPLE) and resource-poor language decoder (RPLD) is trained. Training uses the resource-poor language’s dataset $\mathcal{D}^{\rm rp}$ . The optimization of model parameters is given by

(13)

\ddot{{\mbox{\boldmath$\theta$}}}_{\rm enc},\ddot{{\mbox{\boldmath$\theta$}}}_{\rm dec}=\mathop{\rm arg~{}min}\limits_{\mbox{\boldmath$\Theta$}}\sum_{n=1}^{N}\bigl{\{}-\log{P({\mbox{\boldmath$C$}}^{\rm rp}_{n}\mid{\mbox{\boldmath$I$}}^{\rm rp}_{n};{\mbox{\boldmath$\Theta$}})\bigr{\}}},

where $\ddot{{\mbox{\boldmath$\theta$}}}_{\rm enc}$ and $\ddot{{\mbox{\boldmath$\theta$}}}_{\rm dec}$ are the trained parameters RPLE and RPLD in decoder pre-training.

3.2. Fine-Tuning

In the fine-tuning phase, the proposed method trains the final recognition model using the pre-trained encoder-decoder parameters. Thus, parameters of ME $\dot{{\mbox{\boldmath$\theta$}}}_{\rm enc}$ pre-trained by encoder pre-training and RPLD $\ddot{{\mbox{\boldmath$\theta$}}}_{\rm dec}$ pre-trained by decoder pre-training are used as the initial values for fine-tuning. Fine-tuning is carried out by using the resource-poor language’s dataset $\mathcal{D}^{\rm rp}$ . The optimization of model parameters is given by

(14)

\hat{\dot{{\mbox{\boldmath$\theta$}}}}_{\rm enc},\hat{\ddot{{\mbox{\boldmath$\theta$}}}}_{\rm dec}=\mathop{\rm arg~{}min}\limits_{\mbox{\boldmath$\Theta$}}\sum_{n=1}^{N}\bigl{\{}-\log{P({\mbox{\boldmath$C$}}^{\rm rp}_{n}\mid{\mbox{\boldmath$I$}}^{\rm rp}_{n};{\mbox{\boldmath$\Theta$}})\bigr{\}}},

where $\hat{\dot{{\mbox{\boldmath$\theta$}}}}_{\rm enc}$ and $\hat{\ddot{{\mbox{\boldmath$\theta$}}}}_{\rm dec}$ are the fine-tuned final parameters.

In the experiments in Section 4, we additionally examined a training procedure that fine-tuned the parameters of ME $\dot{{\mbox{\boldmath$\theta$}}}_{\rm enc}$ with a randomly initialized decoder, and a training procedure that fine-tuned the parameters of ME $\dot{{\mbox{\boldmath$\theta$}}}_{\rm enc}$ and MD $\dot{{\mbox{\boldmath$\theta$}}}_{\rm dec}$ .

4. Experiment

We conducted experiments to confirm the effectiveness of the proposed method. We selected Japanese as the resource-poor target language, for which no public large dataset is available. In addition to the limitation posed by the small available dataset, Japanese scene text recognition is further complicated by the diversity of characters. We selected English as the resource-rich language.

4.1. Datasets

We used the Japanese dataset created for the ICDAR2019 robust reading challenge on multilingual scene text detection and recognition (Nayef et al., 2019); it is the only publicly available Japanese scene text dataset, for the resource-poor language’s dataset. The ICDAR2019 dataset holds annotated real and synthesized image data created using the synthesizing method of (Gupta et al., 2016) for end-to-end scene text detection and recognition in 10 languages. To construct the Japanese scene text recognition dataset for this experiment, we first selected Japanese real image data and synthesized image data. Then, we cropped the images according to the annotation for scene text detection, and excluded data that contained characters other than standard Japanese characters. This yielded 9,346 real images and 65,452 synthesized images. Mixing and splitting these images at the ratio of $9:1$ yielded training data of 67,368 images and test data of 7,430 images. The training data was used as $\mathcal{D}^{\rm rp}$ for both pre-training and fine-tuning. There were 2,332 character classes.

We used MJSynth (Jaderberg et al., 2014) as the resource-rich language’s (i.e. English) dataset, $\mathcal{D}^{\rm rr}$ . The number of training data images was 8,027,346.

Table 1. Measured results in terms of recognition accuracy.

Modeling	Training Procedure	Encoder Pre-Training	Decoder Pre-Training	Accuracy (%)
Shi et al. (Shi et al., 2016a)	Baseline	–	–	27.82
Borisyuk et al. (Borisyuk et al., 2018)	Baseline	–	–	29.73
Shi et al. (Shi et al., 2016b)	Baseline	–	–	30.70
Lee et al. (Lee and Osindero, 2016)	Baseline	–	–	47.85
Wang et al. (Wang and Hu, 2017)	Baseline	–	–	48.05
Liu et al. (Liu et al., 2016)	Baseline	–	–	54.48
Baek et al. (Baek et al., 2019)	Baseline	–	–	55.34
Sheng et al. (Sheng et al., 2019)	Baseline	–	–	47.85
Sheng et al. (Sheng et al., 2019)	Training w/ ME	Multilingual	–	59.30
Sheng et al. (Sheng et al., 2019)	Training w/ ME+MD	Multilingual	Multilingual	59.43
Sheng et al. (Sheng et al., 2019)	Proposed training w/ ME+RPLD	Multilingual	Japanese	62.14
Wang et al. (Wang et al., 2019)	Baseline	–	–	65.22
Wang et al. (Wang et al., 2019)	Training w/ ME	Multilingual	–	69.58
Wang et al. (Wang et al., 2019)	Training w/ ME+MD	Multilingual	Multilingual	69.33
Wang et al. (Wang et al., 2019)	Proposed training w/ ME+RPLD	Multilingual	Japanese	72.57

4.2. Setups

We tested the following four training procedures to evaluate the proposed training method. Baseline did not pre-train the recognition model; training used only the Japanese dataset from scratch. In Training w/ ME, we pre-trained ME by using the multilingual dataset made by combining the Japanese dataset and the English dataset, and fine-tuned ME with a randomly initialized decoder by using the Japanese dataset. In Training w/ ME+MD, we pre-trained ME and MD by using the multilingual dataset, and fine-tuned them by using the Japanese dataset. In Proposed training w/ ME+RPLD, we pre-trained ME and RPLD by using the multilingual dataset and the Japanese dataset respectively, and fine-tuned them by using the Japanese dataset.

Two recognition models based on Transformer were evaluated. The first one is the model proposed by Sheng et al. (Sheng et al., 2019); VGG16 (Simonyan and Zisserman, 2015) up to the 10-th convolution layer was applied as the CNN feature extractor. The second one is the model proposed by Wang et al. (Wang et al., 2019); ResNet34 (He et al., 2016) was applied as the CNN feature extractor. The Transformer blocks were composed under the following conditions: the number of Transformer encoder blocks and Transformer decoder blocks were set to 1, the dimensions of the output continuous representations and the inner outputs in the position-wise feed forward networks were set to 512, and the number of heads in the multi-head attentions was set to 4. We also evaluated the models based on RNNs (Shi et al., 2016a; Liu et al., 2016; Wang and Hu, 2017; Borisyuk et al., 2018; Shi et al., 2016b; Lee and Osindero, 2016; Baek et al., 2019). For the models based on RNNs, we evaluated only the baseline training procedure, and the model structures followed the evaluation by (Baek et al., 2019).

For all evaluated models, the input image size was $400\times 64$ , and the input images were scaled to satisfy the resolution. For training, the optimizer was based on stochastic gradient descent (SGD) with learning rate of 0.01. The mini-batch size was set to 32 images. Note that a part of the training data was used for early stopping. As an evaluation metric, we used the recognition accuracy as determined by exact matching.

4.3. Results

The recognition accuracy values are shown in Table 1. First, the recognition models based on Transformer have the same or higher Japanese character recognition accuracy than the recognition model based on RNNs with limited training data. Next, for models based on Transformer, the accuracy obtained by baseline was low compared to the use of pre-training, and utilizing pre-training improved the accuracy. In detail, when we utilized training with ME or ME+MD, the performance improvement was small. On the other hand, our training proposal utilizing ME+RPLD further improved the recognition accuracy. Examples of characters extracted by the model of Wang et al. (Wang et al., 2019) are shown in Figures 3 and 4. Note that the captions for each image on Figures 3 and 4 are (a) ground truth, recognition results of (b) baseline, (c) training w/ ME, (d) training w/ ME+MD, and (e) proposed training w/ ME+RPLD. The results in Figure 3 show that the proposed training method can prevent the erroneous recognition of words in tilted or blurred images; this is mainly the effect of utilizing the multilingual dataset in pre-training the encoder. The results in Figure 4 show that the proposed training method can prevent the false recognition of words that do not exist given relatively long character strings, which is mainly due to pre-training the decoder by utilizing the resource-poor language’s dataset. These results confirm that the proposed training method is an effective way of improving recognition accuracy when the image-to-text paired dataset in the resource-poor language is limited.

We also evaluated the recognition accuracy while varying the size of the English dataset in the pre-training phase. We prepared English datasets of four sizes: 8M as used in the above evaluation, 4M (4,013,673 data), 2M (2,006,837 data), and 1M (1,076,675 data). The resulting accuracy values are shown in Table 2. They show that increasing the data size improves the recognition accuracy. On the other hand, the proposed method is more effective than baseline even when the size of the English dataset is small.

Table 2. Measured results of proposed training in terms of recognition accuracy (%) versus size of English dataset.

Data Size	Sheng et al. (Sheng et al., 2019)	Wang et al. (Wang et al., 2019)
0 (Same as baseline)	47.85	65.22
1M	57.90	68.52
2M	59.07	70.03
4M	59.69	71.28
8M	62.14	72.57

5. Conclusion

This paper proposed a novel training method for an encoder-decoder scene text recognition model for resource-poor languages. The key advance of our method is to utilize a well-prepared large dataset in a resource-rich language, and pre-train the encoder of the recognition model by using a multilingual dataset to capture image features that are not language specific. Our method also pre-trains the decoder of the recognition model by using the resource-poor language’s dataset to ensure its suitability for the resource-poor language. This achieves accurate recognition in the resource-poor language even though the dataset for the resource-poor language is limited. Japanese scene text recognition experiments using small publicly available Japanese dataset confirmed the improvement in the recognition accuracy offered by the proposed method.

References

(1)
Adams et al. (2019) Oliver Adams, Matthew Wiesner, Shinji Watanabe, and David Yarowsky. 2019. Massively Multilingual Adversarial Speech Recognition. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). 96–108.
Baek et al. (2019) Jeonghun Baek, Geewook Kim, Junyeop Lee, Sungrae Park, Dongyoon Han, Sangdoo Yun, Seong Joon Oh, and Hwalsuk Lee. 2019. What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 4714–4722.
Bai et al. (2018) Xiang Bai, Mingkun Yang, Pengyuan Lyu, Yongchao Xu, and Jiebo Luo. 2018. Integrating Scene Text and Visual Appearance for Fine-Grained Image Classification. IEEE Access 6 (2018), 66322–66335.
Biten et al. (2019) Ali Furkan Biten, Rubén Tito, Andrés Mafla, Lluis Gomez, Marçal Rusiñol, C.V. Jawahar, Ernest Valveny, and Dimosthenis Karatzas. 2019. Scene Text Visual Question Answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 4290–4300.
Bleeker and de Rijke (2020) Maurits Bleeker and Maarten de Rijke. 2020. Bidirectional Scene Text Recognition with a Single Decoder. In Proceedings of the European Conference on Artificial Intelligence (ECAI). 2664–2671.
Borisyuk et al. (2018) Fedor Borisyuk, Albert Gordo, and Viswanath Sivakumar. 2018. Rosetta: Large Scale System for Text Detection and Recognition in Images. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD). 71–79.
Gómez et al. (2018) Lluís Gómez, Andrés Mafla, Marçal Rusiñol, and Dimosthenis Karatzas. 2018. Single Shot Scene Text Retrieval. In Proceedings of the European Conference on Computer Vision (ECCV). 700–715.
Graves et al. (2006) Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In Proceedings of the International Conference on Machine Learning (ICML). 369–376.
Gupta et al. (2016) Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. 2016. Synthetic Data for Text Localisation in Natural Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2315–2324.
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778.
Jaderberg et al. (2014) Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition. In Proceedings of the Workshop on Deep Learning, NIPS.
Karaoglu et al. (2017a) Sezer Karaoglu, Ran Tao, Theo Gevers, and Arnold W. M. Smeulders. 2017a. Words Matter: Scene Text for Image Classification and Retrieval. IEEE Transactions on Multimedia 19, 5 (2017), 1063–1076.
Karaoglu et al. (2017b) Sezer Karaoglu, Ran Tao, Jan C. van Gemert, and Theo Gevers. 2017b. Con-Text: Text Detection for Fine-Grained Object Classification. IEEE Transactions on Image Processing 26, 8 (2017), 3965–3980.
Lample and Conneau (2019) Guillaume Lample and Alexis Conneau. 2019. Cross-lingual Language Model Pretraining. In Advances in Neural Information Processing Systems (NeurIPS). 7059–7069.
Lee and Osindero (2016) Chen-Yu Lee and Simon Osindero. 2016. Recursive Recurrent Nets with Attention Modeling for OCR in the Wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2231–2239.
Liu et al. (2016) Wei Liu, Chaofeng Chen, Kwan-YeeK Wong, Zhizhong Su, and Junyu Han. 2016. STAR-Net: A SpaTial Attention Residue Network for Scene Text Recognition. In Proceedings of the British Machine Vision Conference (BMVC).
Liu et al. (2020) Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. 2020. Multilingual Denoising Pre-training for Neural Machine Translation. Proceedings of The Conference on Empirical Methods in Natural Language Processing (EMNLP) (2020).
Long et al. (2021) Shangbang Long, Xin He, and Cong Yao. 2021. Scene Text Detection and Recognition: The Deep Learning Era. International Journal of Computer Vision 129, 1 (2021), 161–184.
Long and Yao (2020) Shangbang Long and Cong Yao. 2020. UnrealText: Synthesizing Realistic Scene Text Images from the Unreal World. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5488–5497.
Lu et al. (2021) Ning Lu, Wenwen Yu, Xianbiao Qi, Yihao Chen, Ping Gong, Rong Xiao, and Xiang Bai. 2021. MASTER: Multi-Aspect Non-local Network for Scene Text Recognition. Pattern Recognition 117 (2021), 107980.
Mishra et al. (2012) Anand Mishra, Karteek Alahari, and C.V. Jawahar. 2012. Scene Text Recognition using Higher Order Language Priors. In Proceedings of the British Machine Vision Conference (BMVC).
Nayef et al. (2019) Nibal Nayef, Yash Patel, Michal Busta, Pinaki Nath Chowdhury, Dimosthenis Karatzas, Wafa Khlif, Jiri Matas, Umapada Pal, Jean-Christophe Burie, Cheng lin Liu, and Jean-Marc Ogier. 2019. ICDAR2019 Robust Reading Challenge on Multi-lingual Scene Text Detection and Recognition. In Proceedings of the IEEE International Conference on Document Analysis and Recognition (ICDAR). 1582–1587.
Sheng et al. (2019) Fenfen Sheng, Zhineng Chen, and Bo Xu. 2019. NRTR: A No-Recurrence Sequence-to-Sequence Model for Scene Text Recognition. In Proceedings of the IEEE International Conference on Document Analysis and Recognition (ICDAR). 781–786.
Shi et al. (2016a) Baoguang Shi, Xiang Bai, and Cong Yao. 2016a. An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 11 (2016), 2298–2304.
Shi et al. (2016b) Baoguang Shi, Xinggang Wang, Pengyuan Lyu, Cong Yao, and Xiang Bai. 2016b. Robust Scene Text Recognition with Automatic Rectification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4168–4176.
Simonyan and Zisserman (2015) Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the International Conference on Learning Representations (ICLR).
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. In Advances in Neural Information Processing Systems (NIPS). 5998–6008.
Wang and Hu (2017) Jianfeng Wang and Xiaolin Hu. 2017. Gated Recurrent Convolution Neural Network for OCR. In Advances in Neural Information Processing Systems (NIPS). 334–343.
Wang et al. (2019) Peng Wang, Lu Yang, Hui Li, Yuyan Deng, Chunhua Shen, and Yanning Zhang. 2019. A Simple and Robust Convolutional-Attention Network for Irregular Text Recognition. arXiv preprint arXiv:1904.01375 (2019).
Yu et al. (2020) Deli Yu, Xuan Li, Chengquan Zhang, Junyu Han, Jingtuo Liu, and Errui Ding. 2020. Towards Accurate Scene Text Recognition With Semantic Reasoning Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 12113–12122.
Zhan et al. (2018) Fangneng Zhan, Shijian Lu, and Chuhui Xue. 2018. Verisimilar Image Synthesis for Accurate Detection and Recognition of Texts in Scenes. In Proceedings of the European Conference on Computer Vision (ECCV). 249–266.
Zhu et al. (2019) Yiwei Zhu, Shilin Wang, Zheng Huang, and Kai Chen. 2019. Text Recognition in Images Based on Transformer with Hierarchical Attention. In Proceedings of the IEEE International Conference on Image Processing (ICIP). 1945–1949.

Utilizing Resource-Rich Language Datasets for End-to-End Scene Text Recognition in Resource-Poor Languages