Towards Robust Handwritten Text Recognition with On-the-fly User Participation

Ajoy Mondal CVIT, IIIT HyderabadIndia [email protected] , Rohit saluja CVIT, IIIT HyderabadIndia [email protected] and C. V. Jawahar CVIT Lab, IIIT HyderabadIndia [email protected]

(2022)

Abstract.

Long-term OCR services aim to provide high-quality output to their users at competitive costs. It is essential to upgrade the models because of the complex data loaded by the users. The service providers encourage the users who provide data where the OCR model fails by rewarding them based on data complexity, readability, and available budget. Hitherto, the OCR works include preparing the models on standard datasets without considering the end-users. We propose a strategy of consistently upgrading an existing Handwritten Hindi OCR model three times on the dataset of $15$ users. We fix the budget of $4$ users for each iteration. For the first iteration, the model directly trains on the dataset from the first four users. For the rest iteration, all remaining users write a page each, which service providers later analyze to select the $4$ (new) best users based on the quality of predictions on the human-readable words. Selected users write $23$ more pages for upgrading the model. We upgrade the model with Curriculum Learning (CL) on the data available in the current iteration and compare the subset from previous iterations. The upgraded model is tested on a held-out set of one page each from all $23$ users. We provide insights into our investigations on the effect of CL, user selection, and especially the data from unseen writing styles. Our work can be used for long-term OCR services in crowd-sourcing scenarios for the service providers and end users.

OCR, service, handwritten, Hindi, robust, curriculum learning.

^†^†copyright: rightsretained^†^†copyright: acmcopyright^†^†isbn: 978-1-4503-9822-0/22/12^†^†conference: 13th Indian Conference on Computer Vision, Graphics and Image Processing; December 2022; Gandhinagar, India^†^†booktitle: Proceedings of the Thirteenth Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP’22), December 8–10, 2022, Gandhinagar, India^†^†journalyear: 2022^†^†price: 15.00^†^†doi: 10.1145/3571600.3571613^†^†article: 13^†^†ccs: Applied computing Optical character recognition

Refer to caption — Figure 1. Top: pipeline showing dissatisfied users’ experience when the ocr service providers and users work in individual silos. Bottom: collaborative pipeline motivating the importance of satisfactory collective experience with on-the-fly user participation in regularly upgrading the ocr model.

1. Introduction

Optical Character Recognition (ocr) is the electronic conversion of printed or handwritten document images into a machine-readable form. ocr is an essential component for document image analysis. Typically, an ocr system includes two main modules (i) a text detection module and (ii) a text recognition module. A text detection module aims to localize all text blocks within the image, either at the word or line levels. In contrast, the text recognition module strives to understand the text image content and transcribe the visual signals into natural language tokens. The problem of handwritten text recognition is more exciting and challenging than printed text due to uneven variations in handwriting style to the writers, content, and time. A person’s handwriting is always unique, and this unique property creates motivation and interest among researchers to work in this imperative and challenging field.

In the field of English handwritten text recognition tasks, several methods are proposed; (i) methods (Chaudhary and Bali, 2021; Ptucha et al., 2019; Yousef et al., 2020) based on only convolutional neural networks, (ii) methods (Kang et al., 2022; Diaz et al., 2021; Li et al., 2021) based on transformer networks, and (iii) methods (Wigington et al., 2017; Bluche and Messina, 2017; Michael et al., 2019; Kang et al., 2018; Bluche, 2016; Chowdhury and Vig, 2018; Sueiras et al., 2018) based on cnn-rnn architectures. Among almost $7000$ languages around the world, ocr systems are mostly available for languages that are of huge importance and strong economic value like English (Graves and Schmidhuber, 2008; Pham et al., 2014; Li et al., 2021), Chinese (Xie et al., 2016; Wu et al., 2017; Peng et al., 2022), Arabic (Maalej and Kherallah, 2020; Jemni et al., 2022), and Japanese (Ly et al., 2018; Nguyen et al., 2020). Among $22$ Indic languages, most of the languages derived from Indic script appear to be at risk of vanishing due to the absence of efforts to preserve them. In many Indic scripts, two or more characters often combine to form conjuncts, considerably increasing the vocabulary to be tackled by ocr systems (cit, 2020). These inherent features of Indic scripts make Handwritten Recognizer (hwr) more challenging than Latin scripts. In contrast to the 52 unique (upper case and lower case) characters in English, most Indic scripts have over 100 unique basic Unicode characters (Pal and Chaudhuri, 2004). Several methods offer solutions to Indic handwritten text recognition tasks. These methods can be categorised into (i) segmentation-free but lexicon-dependent methods (Shaw et al., 2008, 2014; Kaur and Kumar, 2021), (ii) segmentation dependent methods (Arora et al., 2010; Labani et al., 2018; Alonso-Weber et al., 2014; Roy et al., 2016), and (iii) sequence-to-sequence i.e., cnn-rnn methods (Adak et al., 2016; Dutta et al., 2017, 2018; Gongidi and Jawahar, 2021).

All the ocr systems mentioned above for English and Indian languages perform well when the test set’s distribution is similar to the data used to train them. When the test set’s distribution deviates from the training set, the performance of these ocr systems drastically reduces. This situation often arises in a real-world scenario and often leads to a dissatisfied user experience, as shown at the top of Fig. 1. In this work, we propose a controlled framework for long-term ocr services providing high-quality output to their users at fixed budgets. Robust ocr systems must update the models that fail on the complex documents uploaded by the users. As Fig. 1 (bottom) depicts, the regular upgradation of the model collaboratively on users’ data and standard datasets can help achieve a satisfactory experience for users as well as service providers. The process also involves user selection based on the available budget for annotation and rewarding the users for the data. We expressly assume that serving a large number of users, say $N$ , is the end goal of the ocr system. In our setting, we have $N_{1}=4$ users available at the first iteration of upgrading the model, $N_{2}=10$ at the second, and $N_{3}=15$ at the third iteration. We assume that $N_{3}$ represents $N$ to investigate the subject matter in a controlled setting. We fix the budget of $m=4$ users for each iteration. At the start of the second (and third) iteration, all the new users write a page, which the experts or service providers analyze to select the $m$ (new) best users based on the quality of ocr predictions on the human-readable words. Each of the selected $m$ users (or fixed $m$ users for iteration 1) writes $23$ more pages for training the model. We upgrade the model in each iteration with Curriculum Learning (cl) and test it on the held-out set consisting of one page each from all the users ( $N_{3}$ ). We provide insights analysis on the effect of cl, user selection, and performance of our system on unseen writing styles. The key contributions of this work are as follows:

•

Our setup is unique compared to the previous ocr works which do not include users’ satisfaction or participation. We aim to solve the problem in a real-life scenario, where users continuously grow for an ocr service, trying to satisfy them with the limited budget of the service provider.
•

Our experiments show that i) cl can be effective in continuously upgrading the ocr model on users’ data, ii) user selection based on ocr predictions on human-readable user’s data can improve results over other selections with fixed budget, and iii) Observation on the dataset from users which the model has never seen (or trained upon) can help quantify the robustness of the models.

2. Related Work

OCR Services: Tesseract (Tesseract, 2022) is an open-source ocr system that works for over $100$ languages. The first three versions of Tesseract ocr work on the principle of classical machine learning techniques. A Gaussian Mixture Model (GMM) based classifier identifies characters from features based on the vectors obtained from text boundaries. The subsequent versions of Tesseract involve line-level deep models. Google Docs ocr works for over $245$ languages (Google, 2020). Tafti et al. (Tafti et al., 2016) perform qualitative and quantitative analysis on Google Docs ocr, Tesseract, ABBYY FineReader, and Transym using $1227$ images from $15$ categories like printed, handwritten, noisy, multi-oriented, multi-lingual images, etc. While most services work well on printed text, many fail on handwritten and noisy text images. While some of the commercial service providers might be using the users’ data to improve their models, none of the ocr services mentioned above provide a transparent up-gradation process based on users’ inputs like ours with a focus on improving the overall user experience.

Handwritten Text Recognition: In the space of English handwritten text recognition, few works have used entirely convolutional neural networks without using any recurrent architectures (Chaudhary and Bali, 2021; Ptucha et al., 2019; Yousef et al., 2020). Few recent networks (Ingle et al., 2019; Coquenet et al., 2020) use gating mechanisms in cnns to compensate for the dependency on lstm cells, known as gated Convolutional Neural Networks (gcn). These types of networks outperform fully convolutional Network yet they lag behind rnn/Transformer based ocr models (Kang et al., 2022; Diaz et al., 2021; Li et al., 2021). Recurrent Neural Networks (rnns) are successfully applied to solve Handwritten Text Recognition (htr) tasks. LSTM-based models can handle long-term context in sequences. The most common architectures (Wigington et al., 2017; Bluche and Messina, 2017) are combination of cnn and rnn, where cnn is used for feature extraction from word or line images and rnn is used for modeling sequential context. Several works (Michael et al., 2019; Kang et al., 2018; Bluche, 2016; Chowdhury and Vig, 2018; Sueiras et al., 2018) use various attention mechanisms to improve the performance of cnn +lstm. Recently, Transformer based text recognizers (Kang et al., 2022; Diaz et al., 2021; Li et al., 2021) achieved state-of-the-art performance. Some of these works use a cnn-based backbone with self-attention as encoders to understand document images (Li et al., 2021).

Officially, there are $22$ languages in India, many of which are used only for communication purposes. Among these languages, Hindi, Bangla, and Telugu are the top three languages in terms of the percentage of native speakers (Krishnan and Jawahar, 2019). In many Indic scripts, two or more characters often combine to form conjuncts which considerably increasing the vocabulary to be tackled by ocr systems (cit, 2020). These inherent features of Indic scripts make Handwritten Recognizer (hwr) more challenging than Latin scripts. Compared to the 52 unique (upper case and lower case) characters in English, most Indic scripts have over 100 unique basic Unicode characters (Pal and Chaudhuri, 2004).

Three popular ways of building handwritten word recognizers for Indic scripts are available in the literature. The first one is segmentation-free but lexicon-dependent methods which train on the representation of the whole words (Shaw et al., 2008, 2014; Kaur and Kumar, 2021). Shaw et al. (Shaw et al., 2008) represent word images using a histogram of chain-code directions in the image-strips scanning from left to right by a sliding window as the feature vector. A continuous density Hidden Markov Model (hmm) recognizes handwritten Devanagari words. Shaw et al. (Shaw et al., 2014) discuss a novel combination of two different feature vectors for holistic recognition of offline handwritten word images in the same direction. The second category of approaches involves segmentation of the characters within the word images and recognition of isolated characters using an isolated symbol classifier such as Support Vector Machine (svm) (Arora et al., 2010), and Artificial Neural Network (ann) (Labani et al., 2018; Alonso-Weber et al., 2014). Roy et al. (Roy et al., 2016) segment Bengali and Devanagari word images into the upper, middle, and lower zones, using morphology and shape matching. The symbols in the upper and lower zone are recognized using an svm, while a Hidden Markov Model (hmm) is used to recognize the characters in the middle zone. Finally, the results from all three zones are combined. This category of approaches suffers from the drawback of using an error-prone script-dependent character segmentation algorithm. The third category of approaches treats word recognition as a sequence-to-sequence prediction task where both the input and output are treated as vector sequences. The aim is to maximize the probability of predicting the output label sequence given the input feature sequence (Adak et al., 2016; Dutta et al., 2017, 2018). Garain et al. (Garain et al., 2015) propose a recognizer using Bidirectional Long Short-term Memory (blstm) with Connectionist Temporal Classification (ctc) layer to recognize unconstrained Bengali offline handwriting words. Adak et al. (Adak et al., 2016) use Convolutional Neural Network (cnn) integrated with an lstm along with a ctc layer to recognize Bengali handwritten word. In the same direction, Dutta et al. proposed CNN-RNN hybrid end-to-end model to recognize Devanagari, Bengali (Dutta et al., 2017), and Telugu (Dutta et al., 2018) handwritten words. In one of the works by Gongidi (Gongidi and Jawahar, 2021), the authors use Spatial Transformer Network along with hybrid cnn-rnn with ctc layer to recognize word images in eight different Indic scripts such as Bengali, Gurumukhi, Gujarati, Odia, Kannada, Malayalam, Tamil, and Urdu. The authors use various data augmentation functions to improve recognition accuracy. This category of methods does not require character-level segmentation and is not bound to recognize a limited set of words.

Curriculum Learning: The Curriculum Learning (cl) methods are commonly used for computer vision tasks like object recognition (Hacohen and Weinshall, 2019; Mousavi et al., 2021; Wang et al., 2019) and object detection (Singh et al., [n.d.]; Goyal et al., 2022; Wang et al., 2018). The works mentioned above use cl to handle intra-class scale variations, inter-class confusions, and challenges involved in weakly-supervised or semi-supervised training. Our method utilizes the cl-based handwriting recognition ocr model to collaboratively learn from the datasets of the service providers and continuously update the dataset from the growing number of users.

3. Methodology

This section describes the assumptions and the pipeline of our framework. Consider the problem of updating an ocr service on the complex data provided by different users. The number of users generally grows in such scenarios. The end goal of ocr service is to provide a satisfactory experience to many users, which we refer to as $N$ . However, practically all the users cannot contribute much data but can contribute a few (one or two) pages. Some of the regular users will naturally contribute more data for upgrading the model at arbitrary intervals since such users are also interested in utilizing such a model repeatedly for their documents. The overall situation can be complex in real-life scenarios to the extent that users can arbitrarily grow. The budget of the service providers and the number of pages shared by each can also vary with time. Hence, we make the following assumptions to investigate the work in the controlled setting, which we discuss in the next subsection.

3.1. Setting

We define the controlled setting we work with as follows:

•

As shown in the black arrow on the top of Fig. 2, we assume that a limited number of users, i.e., $N_{3}=15$ approximates the desired value $N$ owing to the limited participants we have. The goal is to develop the robust model for $N_{3}$ users by using the training data from $<N_{3}$ users depending on the fixed budget $m=4$ for each of the three training or upgrading iterations.
•

We fix the test set as one page written from all $N_{3}$ users as shown in Fig. 2 (top) in pink.
•

As Fig. 2 depicts, we upgrade the models at a fixed set of intervals, i.e., after receiving the data from $N_{1}=4$ , $N_{2}=10$ , and $N_{3}=15$ users, respectively, which we refer to as three iterations.
•

As Fig. 2 (middle) in purple depicts,the validation dataset grows along with the training iteration. The validation set includes one page from each user who is available (partly or completely) until the iteration ends.
•

The latest model from the previous iteration evaluates the performance of the validation data. Based on the ocr performance on each user’s validation page, the experts select $m=4$ new users for the current iteration. Here, $m$ is the (fixed) budget signifying the number of users per iteration. We discuss the selection criteria used by experts in the following subsections.
•

Each of the selected $m$ users write $23$ more pages as shown in Fig. 2 (middle) in green. The model trained with Curriculum Learning (cl) on the dataset from selected users in the current iteration and equal proportions of data from previous iterations. Since the data from users selected in previous iterations are already available, they do not contribute to the training data in the current iteration again.

We discuss the entire pipeline of our system along with the user selection criteria used by the experts in the following subsection.

3.2. Pipeline

In Fig. 3, we illustrate the details of training our ocr model with Curriculum Learning (cl) on a standard dataset and the dataset collected by the users (or writers). We now discuss the collaborative training process in different iterations.

Collaborative Training in First Iteration: In the first iteration of our process, the $N_{1}$ users write the editable text provided to them, as shown at the top of Fig. 3. As the bottom-right part of the figure depicts, the handwritten documents are then passed to the annotators for adding word-level bounding boxes over the images and aligning the editable text with the bounding boxes. Finally, experts look at the word images and the corresponding text and discard the non-readable words. The discarded samples are shown in the bottom-middle of Fig. 3. Finally, as the bottom-left of Fig. 3 depicts, we update the ocr model with cl on the equal number of word-level samples from the standard dataset, as obtained from the clean users’ data after filtering it from the experts.

Collaborative Training in Subsequent Iterations: As discussed in the previous sub-section, for iterations, $t\in[2,3]$ , the validation set provided by each new user is analyzed by the experts to finalize the $m$ best users. Particularly, the users with the worst ocr accuracy on human-readable samples are selected so that their data can contribute more to the robustness. There is a thin line between the hard readable examples and the non-human-readable examples. Hence, the experts also give the reasons for classifying different samples into readable, hard (but readable), and non-human-readable categories. Hard readable samples from different iterations are shown in Fig. 4 (left), along with the expert reasons (for marking them hard) below the dotted blue boxes. Generally, hard samples contain extreme noise and blur while scanning document images using a phone camera, rewriting certain glyphs over others, and overlapping word or character components. Fig. 4 (right) illustrates the sample non-human-readable samples discarded by the experts. As shown in the text below dotted boxes, non-readable samples involve clear spelling mistakes and bad handwriting cases, which generally lead to Out-Of-Vocabulary (OOV) word predictions. For user selection, experts also analyze the average character confidence over the data from different users. However, as shown in Fig. 5, we notice that the OCR confidence (of the model trained on iteration 1 data) remains high irrespective of the data samples from different users with varying character error rates. At the end of each iteration $t\in[1,3]$ , we update the ocr model with cl on the $m$ users’ dataset obtained from the current iteration and proportionate data¹¹1The number of samples used to update the model from the dataset from iterations $[0,t-1]$ is kept equal to the number of samples obtained in the iteration $t$ . from the previous iterations $[0,t-1]$ (along with the standard dataset denoted by iteration $0$ ).

4. Experiments

We use the network architecture proposed by Gongidi et al. (Gongidi and Jawahar, 2021) for this experiment and refer to its model and dataset as belonging to iteration $0$ . The used network consists of four modules: Transformation Network (tn), Feature Extractor (fe), Sequence Modeling (sm), and finally Predictive Modeling (pm). The transformation Network has six plain convolutional layers with $16$ , $32$ , $64$ , $128$ , $128$ , and $128$ channels. Each layer has the filter size, stride, and padding size of $3$ , $1$ , and $1$ , followed by a $2\times 2$ max-pooling layer with a stride of $2$ . The Feature Extractor module consists of ResNet architecture. The Sequence Modeling consists of a $2$ layer Bidirectional lstm (blstm) architecture with $256$ hidden neurons in each layer. The predictive Modeling consists of Connectionist Temporal Classification (ctc) to decode and recognize the characters by aligning the feature sequence and the target character sequence. We resize input images into $96\times 256$ . We use Adadelta optimizer for Stochastic Gradient Descent (sgd) for all the experiments. We set the learning rate to $0.001$ , batch size to $64$ , and momentum to $0.09$ . For every curriculum fine-tuning, we reset the learning rate to 0.001. The code and trained model are available at ²²2https://github.com/ajoymondal/Towards-Robust-Handwritten-Text-Recognition-with-On-the-fly-User-Participation.

We carry out the following experiments to study the effect of curriculum learning, and user selection:

•

Iter1CL: We compare the Iteration $1$ model trained with i) Curriculum Learning (cl) on dataset from iterations $[0,1]$ and ii) without cl, i.e., with just Fine Tuning Iteration $0$ model on Iteration $1$ training sets. The former is referred to as Iter1CL and latter as Iter1FT.
•

Iter2CLm: For Iteration $2$ , we compare the model trained on i) training set from $m$ selected users (Iter2CLm) and ii) swapping the training set from one and two of the users in the selected set of $m$ users, with a training set of remaining (non-selected) users available in Iteration $2$ . We refer to them as Iter2CLmS1 and Iter2CLmS2. For Iter2CLmS1 the dataset from a user with the highest ocr quality³³3We refer to the user whose validation set has the highest ocr quality among other users as the best performing user. among $m$ users is swapped with data from user with the overall highest ocr quality on validations set. For Iter2CLmS2, the dataset from two users with the highest ocr quality, among $m$ users are swapped with data from the two users with the overall highest ocr quality on validations set.
•

Iter3CLm: Finally, we compare the model upgraded on the data collected in the third iteration Iter3CLm, with Iter1CL and Iter2CLm on overall test set, common seen ⁴⁴4Seen/unseen throughout the paper means the model has seen/unseen the writer’s (or user’s) distribution during training. A common seen/unseen test set is the test set from users whose data is seen/unseen in all the iterations. test set, and common unseen test set, to demonstrate the generalization capability of the final model.

Table 1. Results showing effect of the Curriculum Learning on the test set of pages collected from 15 users.

S.No.	Model	CRR	WRR
1	Iter0 (Gongidi and Jawahar, 2021)	69.35	31.61
2	Iter1FT	88.34	70.22
3	Iter1CL	88.43	70.58

5. Results

Effect of Curriculum Learning: As shown in the first two rows of Table 1, just fine-tuning the model by Gongidi et al. (Gongidi and Jawahar, 2021) on Iteration $1$ dataset, improve the Character Recognition Rate (CRR) and Word Recognition (WRR) significantly by $19\%$ and $38\%$ on the test set. However, Curriculum Learning (cl) on the samples collected from the first four users and the equal number of samples from the large training set of Gongidi et al. (Gongidi and Jawahar, 2021) (words) help us with further improvements, as shown in the third row of Table 1. As we will see in the next paragraphs, the cl also helps us improve results in subsequent iterations.

Table 2. User selection for iteration 2 based on Human-Readable (HR) words from val. set. Selected users in bold.

S.No.	User ID	Model	WRR	WRR-HR
1	5		63.11	73.03
2	6		51.06	66.67
3	7	Iter1CL	65.12	81.25
4	8		84.61	89.19
5	9		80.00	84.44
6	10		43.39	55.00

Effect of User Selection: The results of Iter1CL model of validation set of Iteration $2$ from users with IDs $[5,10]$ are shown in Table 2. Firstly, as shown in the last two columns of the table, the WRR on Human-Readable (WRR-HR) words is generally higher than the WRR on all the words (WRR) in the validation set. However, the sorting order of the recognition rates may change if we consider all words against only Human-Readable (HR) words. Selecting users based on the worst WRR can have an adverse effect on training the model, since we cannot expect the model to perform well on non-readable samples with spelling mistakes and genuine n-gram confusion errors that occur due to awful handwriting (refer Fig. 4). Based on WRR-HRR, users with IDs $\{5,6,7,10\}$ are selected for training the Iter2CL model.

Table 3. Results showing effect of User Selection on test set of pages collected from 15 users.

S.No.	Model	CRR	WRR
1	Iter2CLmS2	89.34	73.12
2	Iter2CLmS1	90.49	75.85
3	Iter2CLm	91.25	78.43

The results of model trained on the selected users (with IDs $\{5,6,7,10\}$ ) are shown in the third row of Table 3. As shown, the user selection based on WRR-HR performs best as compared to the other two variants (rows $1-2$ of Table 3). Iter2CLmS1 involves swapping the best performing selected user’s (i.e., user $7$ ) training set with the overall best performing user’s (i.e., user $8$ ) training set. Iter2CLmS1 achieves slightly degraded performance with a drop of around $1\%$ in CRR and $2\%$ in WRR compared to Iter2CLm as shown in the last two rows of Table 3. As the first row of the table depicts, the performance drops further in similar proportions if we swap the training data from users $\{5,7\}$ , with training data from users $\{8,9\}$ .

Table 4. User selection for iteration 3 based on Human-Readable (HR) words from val. set. Selected users in bold.

S.No.	User ID	Model	WRR	WRR-HR
1	8		84.61	89.19
2	9		90.00	93.62
3	11		77.36	79.59
4	12	Iter2CLm	68.52	82.22
5	13		84.78	88.64
6	14		83.72	85.36
7	15		82.50	93.94

Results on different Iterations: Since the users with IDs $\{8,9\}$ were not selected in the second iteration, we reconsider them in the third iteration. Interestingly, as shown in Table 4, the performance of Iter2CLm model on (unseen) user $8$ validation set is retained compared to Iter1CL model (Table 3 row 4, and the performance on (unseen) user $9$ data improve significantly by $9\%$ (compare row 5 of Table 3 with row 2 of Table 4). Another interesting highlight of Table 4) is that WRR-HR of Iter2CLm model on all the unseen validation sets from users with IDs $\{8,9,11,12,13,14,15\}$ is close to or above $80\%$ . This shows that Iter2CLm has generalized well. Based on WRR-HR, we select users with IDs $\{11,12,13,14\}$ for training the model in iteration 3.

Table 5. Results of model trained at different Iterations on overall test set, common seen test set (from users

\{1,2,3,4\}

), and common unseen test set (from users

\{8,9,15\}

Model	All Test Set		Seen Test Set		Unseen Test Set
	CRR	WRR	CRR	WRR	CRR	WRR
Iter1CL	88.43	70.58	94.26	80.44	92.53	79.33
Iter2CLm	91.25	78.43	94.29	83.62	95.04	85.28
Iter3CLm	92.32	79.04	94.89	84.29	95.76	85.68

The first three rows of Table 5 reestablish that Iter2CLm has generalized well over the overall test set. A similar trend is observed on the seen test set from the users with ids $\{1,2,3,4\}$ who are common seen users across the three iterations, and the unseen test set from users with ids $\{8,9,15\}$ who are common unseen users across the three iterations. The CRR and WRR improve significantly by around $2-3\%$ and $6-8\%$ compared to the iteration $1$ model for the overall test set and standard unseen test set as shown in rows 1-2, columns 2, 3, 6, and 7 of Table 5. However, as shown in rows 1-2 and columns 4-5 of the table, the performance gains are slightly lesser (around $0\%$ in CRR and $3\%$ in WRR) on the commonly seen test set. The slight performance gains of Iter3CLm over Iter2CLm by $<1\%$ as shown in the last row of Table 5 on all the three types of test sets, show that we have reached close to the saturation performance on the test set of users $1-15$ . The first three rows in Fig. 6 for qualitative results on the unseen dataset from users $\{8,9,15\}$ also show that the performance of Iter2CLm and Iter3CLm is similar. One of the rare words corrected in the third iteration and a failure case is shown in the last two rows of Fig. 6. To get a rough idea, one can also compare the performance of Iter3CLm with the CRR and WRR of $93.98$ and $75.41$ reported by Gongidi et al. (Gongidi and Jawahar, 2021), wherein the model has seen the writing style of all the users in the test set. Overall, one may argue that the writing style of common unseen users in our case is similar to the seen users, and that is the reason that the accuracy of the unseen test set is high. However, we reason that the high recognition rates on the unseen test set are due to the user selection process described in the previous sections. The selection process takes care that the users with high validation accuracies are less likely to improve the model’s generic performance. Hence such users do not contribute more to the training data in subsequent iterations and are filtered out as common unseen users.

6. Conclusion

We proposed a controlled framework for upgrading the handwritten ocr model on the dataset provided by $15$ users. We upgraded the model three times with a fixed budget of $23$ pages, each from four different users in each iteration. Our work lays the foundation for long-term ocr services in crowd-sourcing user scenarios. We believe that our experiments have shown the effectiveness of Curriculum Learning (cl) in regular upgradation of the model and user selection under fixed budgets. We finally provide a way to quantify the robustness of ocr models using the data with writing style from unseen users. In the future, we would like to explore our work in a real crowd-sourcing scenario in multiple Indian languages, along with numerous factors and a model for user selection.

References

(1)
cit (2020) Accessed March 26 2020. Script Grammar. For Indian Languages. http://language.worldofcomputing.net/grammar/script-grammar.html.
Adak et al. (2016) Chandranath Adak, Bidyut B Chaudhuri, and Michael Blumenstein. 2016. Offline cursive Bengali word recognition using CNNs with a recurrent model. In ICFHR. 429–434.
Alonso-Weber et al. (2014) Juan Manuel Alonso-Weber, MP Sesmero, and Araceli Sanchis. 2014. Combining additive input noise annealing and pattern transformations for improved handwritten character recognition. Expert systems with applications 41, 18 (2014), 8180–8188.
Arora et al. (2010) Sandhya Arora, Debotosh Bhattacharjee, Mita Nasipuri, Latesh Malik, Mohantapash Kundu, and Dipak Kumar Basu. 2010. Performance comparison of SVM and ANN for handwritten devnagari character recognition. arXiv (2010).
Bluche (2016) Théodore Bluche. 2016. Joint line segmentation and transcription for end-to-end handwritten paragraph recognition. Advances in neural information processing systems 29 (2016).
Bluche and Messina (2017) Théodore Bluche and Ronaldo Messina. 2017. Gated convolutional recurrent neural networks for multilingual handwriting recognition. In ICDAR, Vol. 1. 646–651.
Chaudhary and Bali (2021) Kartik Chaudhary and Raghav Bali. 2021. EASTER: Simplifying Text Recognition using only 1D Convolutions.. In Canadian Conference on AI.
Chowdhury and Vig (2018) Arindam Chowdhury and Lovekesh Vig. 2018. An efficient end-to-end neural model for handwritten text recognition. arXiv (2018).
Coquenet et al. (2020) Denis Coquenet, Clément Chatelain, and Thierry Paquet. 2020. Recurrence-free unconstrained handwritten text recognition using gated fully convolutional network. In ICFHR. IEEE, 19–24.
Diaz et al. (2021) Daniel Hernandez Diaz, Siyang Qin, Reeve Ingle, Yasuhisa Fujii, and Alessandro Bissacco. 2021. Rethinking text line recognition models. arXiv (2021).
Dutta et al. (2017) Kartik Dutta, Praveen Krishnan, Minesh Mathew, and CV Jawahar. 2017. Towards accurate handwritten word recognition for Hindi and Bangla. In NCVPRIPG. 470–480.
Dutta et al. (2018) Kartik Dutta, Praveen Krishnan, Minesh Mathew, and C V Jawahar. 2018. Towards spotting and recognition of handwritten words in indic scripts. In ICFHR. 32–37.
Garain et al. (2015) Utpal Garain, Luc Mioulet, Bidyut B Chaudhuri, Clement Chatelain, and Thierry Paquet. 2015. Unconstrained Bengali handwriting recognition with recurrent models. In ICDAR.
Gongidi and Jawahar (2021) Santhoshini Gongidi and CV Jawahar. 2021. iiit-indic-hw-words: A Dataset for Indic Handwritten Text Recognition. In International Conference on Document Analysis and Recognition. Springer, 444–459.
Google (2020) Google. 2020. Google’s Optical Character Recognition (OCR) Software Works for 248+ Languages. https://opensource.com/life/15/9/open-source-extract-text-images. Last accessed on 13 August.
Goyal et al. (2022) Aman Goyal, Dev Agarwal, Anbumani Subramanian, CV Jawahar, Ravi Kiran Sarvadevabhatla, and Rohit Saluja. 2022. Detecting, Tracking and Counting Motorcycle Rider Traffic Violations on Unconstrained Roads. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4303–4312.
Graves and Schmidhuber (2008) Alex Graves and Jürgen Schmidhuber. 2008. Offline handwriting recognition with multidimensional recurrent neural networks. In NIPS.
Hacohen and Weinshall (2019) Guy Hacohen and Daphna Weinshall. 2019. On The Power of Curriculum Learning in Training Deep Networks. In International Conference on Machine Learning. PMLR, 2535–2544.
Ingle et al. (2019) R Reeve Ingle, Yasuhisa Fujii, Thomas Deselaers, Jonathan Baccash, and Ashok C Popat. 2019. A scalable handwritten text recognition system. In ICDAR. IEEE, 17–24.
Jemni et al. (2022) Sana Khamekhem Jemni, Sourour Ammar, and Yousri Kessentini. 2022. Domain and writer adaptation of offline Arabic handwriting recognition using deep neural networks. Neural Computing and Applications (2022).
Kang et al. (2022) Lei Kang, Pau Riba, Marçal Rusiñol, Alicia Fornés, and Mauricio Villegas. 2022. Pay attention to what you read: non-recurrent handwritten text-line recognition. Pattern Recognition 129 (2022), 108766–108778.
Kang et al. (2018) Lei Kang, J Ignacio Toledo, Pau Riba, Mauricio Villegas, Alicia Fornés, and Marçal Rusinol. 2018. Convolve, attend and spell: An attention-based sequence-to-sequence model for handwritten word recognition. In German Conference on Pattern Recognition. 459–472.
Kaur and Kumar (2021) Harmandeep Kaur and Munish Kumar. 2021. On the recognition of offline handwritten word using holistic approach and AdaBoost methodology. Multimedia Tools and Applications (2021).
Krishnan and Jawahar (2019) Praveen Krishnan and C V Jawahar. 2019. Hwnet v2: An efficient word image representation for handwritten documents. IJDAR (2019).
Labani et al. (2018) Mahdieh Labani, Parham Moradi, Fardin Ahmadizar, and Mahdi Jalili. 2018. A novel multivariate filter method for feature selection in text classification problems. Engineering Applications of Artificial Intelligence 70 (2018), 25–37.
Li et al. (2021) Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, and Furu Wei. 2021. Trocr: Transformer-based optical character recognition with pre-trained models. arXiv (2021).
Ly et al. (2018) Nam Tuan Ly, Cuong Tuan Nguyen, and Masaki Nakagawa. 2018. Training an end-to-end model for offline handwritten Japanese text recognition by generated synthetic patterns. In ICFHR.
Maalej and Kherallah (2020) Rania Maalej and Monji Kherallah. 2020. Improving the DBLSTM for on-line Arabic handwriting recognition. Multimedia Tools and Applications (2020).
Michael et al. (2019) Johannes Michael, Roger Labahn, Tobias Grüning, and Jochen Zöllner. 2019. Evaluating sequence-to-sequence models for handwritten text recognition. In ICDAR. IEEE, 1286–1293.
Mousavi et al. (2021) Hamidreza Mousavi, Maryam Imani, and Hassan Ghassemian. 2021. Deep Curriculum Learning for PolSAR Image Classification. arXiv preprint arXiv:2112.13426 (2021).
Nguyen et al. (2020) Kha Cong Nguyen, Cuong Tuan Nguyen, and Masaki Nakagawa. 2020. A Semantic Segmentation-based Method for Handwritten Japanese Text Recognition. In ICFHR.
Pal and Chaudhuri (2004) Umapada Pal and BB Chaudhuri. 2004. Indian script character recognition: a survey. pattern Recognition (2004).
Peng et al. (2022) Dezhi Peng, Lianwen Jin, Weihong Ma, Canyu Xie, Hesuo Zhang, Shenggao Zhu, and Jing Li. 2022. Recognition of Handwritten Chinese Text by Segmentation: A Segment-annotation-free Approach. IEEE Transactions on Multimedia (2022).
Pham et al. (2014) Vu Pham, Théodore Bluche, Christopher Kermorvant, and Jérôme Louradour. 2014. Dropout improves recurrent neural networks for handwriting recognition. In ICFHR.
Ptucha et al. (2019) Raymond Ptucha, Felipe Petroski Such, Suhas Pillai, Frank Brockler, Vatsala Singh, and Paul Hutkowski. 2019. Intelligent character recognition using fully convolutional neural networks. Pattern recognition 88 (2019), 604–613.
Roy et al. (2016) Partha Pratim Roy, Ayan Kumar Bhunia, Ayan Das, Prasenjit Dey, and Umapada Pal. 2016. HMM-based Indic handwritten word recognition using zone segmentation. Pattern recognition 60 (2016), 1057–1075.
Shaw et al. (2014) Bikash Shaw, Ujjwal Bhattacharya, and Swapan K Parui. 2014. Combination of features for efficient recognition of offline handwritten Devanagari words. In ICFHR. 240–245.
Shaw et al. (2008) Bikash Shaw, Swapan Kumar Parui, and Malayappan Shridhar. 2008. Offline Handwritten Devanagari Word Recognition: A holistic approach based on directional chain code feature and HMM. In ICIT. 203–208.
Singh et al. ([n.d.]) Deepak Kumar Singh, Shyam Nandan Rai, KJ Joseph, Rohit Saluja, Vineeth N Balasubramanian, Chetan Arora, Anbumani Subramanian, and CV Jawahar. [n.d.]. ORDER: Open World Object Detection on Road Scenes. ([n. d.]).
Sueiras et al. (2018) Jorge Sueiras, Victoria Ruiz, Angel Sanchez, and Jose F Velez. 2018. Offline continuous handwriting recognition using sequence to sequence neural networks. Neurocomputing 289 (2018), 119–128.
Tafti et al. (2016) Ahmad P Tafti, Ahmadreza Baghaie, Mehdi Assefi, Hamid R Arabnia, Zeyun Yu, and Peggy Peissig. 2016. OCR as a service: an experimental evaluation of Google Docs OCR, Tesseract, ABBYY FineReader, and Transym. In International Symposium on Visual Computing. Springer, 735–746.
Tesseract (2022) Tesseract. 2022. Tesseract Open Source OCR. https://github.com/tesseract-ocr/ (2022). Last accessed on 13 August.
Wang et al. (2018) Jiasi Wang, Xinggang Wang, and Wenyu Liu. 2018. Weakly- and Semi-supervised Faster R-CNN with Curriculum Learning. In 24th International Conference on Pattern Recognition (ICPR). IEEE, 2416–2421.
Wang et al. (2019) Yiru Wang, Weihao Gan, Jie Yang, Wei Wu, and Junjie Yan. 2019. Dynamic Curriculum Learning for Imbalanced Data Classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5017–5026.
Wigington et al. (2017) Curtis Wigington, Seth Stewart, Brian Davis, Bill Barrett, Brian Price, and Scott Cohen. 2017. Data augmentation for recognition of handwritten words and lines using a CNN-LSTM network. In ICDAR. 639–645.
Wu et al. (2017) Yi-Chao Wu, Fei Yin, Zhuo Chen, and Cheng-Lin Liu. 2017. Handwritten Chinese text recognition using separable multi-dimensional recurrent neural network. In ICDAR.
Xie et al. (2016) Zecheng Xie, Zenghui Sun, Lianwen Jin, Ziyong Feng, and Shuye Zhang. 2016. Fully convolutional recurrent network for handwritten chinese text recognition. In ICPR.
Yousef et al. (2020) Mohamed Yousef, Khaled F Hussain, and Usama S Mohammed. 2020. Accurate, data-efficient, unconstrained text recognition with convolutional neural networks. Pattern Recognition 108 (2020), 107482.