[1]\fnmRobert \surTurnbull \orcid0000-0003-1274-6750

[1]\orgdivMelbourne Data Analytics Platform, \orgnameThe University of Melbourne, \orgaddress\streetGrattan St., \cityParkville, \postcode3010, \stateVIC, \countryAustralia

2]\orgdivMelbourne Centre for Data Science, \orgnameThe University of Melbourne, \orgaddress\streetGrattan St., \cityParkville, \postcode3010, \stateVIC, \countryAustralia

Detecting and recognizing characters in Greek papyri with YOLOv8, DeiT and SimCLR

[email protected] \fnmEvelyn \surMannix \orcid0000-0003-1798-6470 [email protected] * [

Abstract

Purpose: The capacity to isolate and recognize individual characters from facsimile images of papyrus manuscripts yields rich opportunities for digital analysis. For this reason the ‘ICDAR 2023 Competition on Detection and Recognition of Greek Letters on Papyri’ was held as part of the 17^th International Conference on Document Analysis and Recognition. This paper discusses our submission to the competition.

Methods: We used an ensemble of YOLOv8 models to detect and classify individual characters and employed two different approaches for refining the character predictions, including a transformer based DeiT approach and a ResNet-50 model trained on a large corpus of unlabelled data using SimCLR, a self-supervised learning method.

Results: Our submission won the recognition challenge with a mAP of 42.2%, and was runner-up in the detection challenge with a mean average precision (mAP) of 51.4%. At the more relaxed intersection over union threshold of 0.5, we achieved the highest mean average precision and mean average recall results for both detection and classification.

Conclusion: The results demonstrate the potential for these techniques for automated character recognition on historical manuscripts. We ran the prediction pipeline on more than 4,500 images from the Oxyrhynchus Papyri to illustrate the utility of our approach, and we release the results publicly in multiple formats.

keywords:

character, detection, recognition, greek, papyri, oxyrhynchus

pacs:

[

MSC Classification]68T10, 62M45

1 Introduction

The challenge of publishing editions of manuscripts from the ancient world is immense. There are countless documents in libraries, museums and monasteries that have yet to be edited or published in a digital form. For example, the Oxyrhynchus Papyri, discovered by Bernard P. Grenfell and Arthur S. Hunt, contains approximately half a million papyri and papyrus fragments [14] only a small fraction of which have been published.

Machine learning can assist in the tasks of editing and analyzing these collections. The ability to detect and recognize individual characters is especially useful and can be used for tasks such as assisting with approximate dating and identifying disiecta membra written by the same scribe [6]. Automating the transcription of these documents also allows large corpora of scanned documents to be quickly searched for keywords or phrases of interest [1].

To compare and evaluate methods for this task, Mathias Seuret and colleagues ran the ‘ICDAR 2023 Competition on Detection and Recognition of Greek Letters on Papyri’ as part of the 17^th International Conference on Document Analysis and Recognition [19]. The competition evaluated two tasks, one for detecting the bounding box for individual characters and the other for correctly recognizing the letter within each bounding box. This paper outlines our submission and the results of applying our method to published images of the Oxyrhynchus Papyri.

2 Data

2.1 Competition Dataset

The training dataset provided for the competition consisted of 153 images from 108 Greek papyrus manuscripts which preserve text from the Iliad of Homer. Bounding boxes were added and annotated with the Greek letter for each character present. Apostrophes and periods were also annotated but ignored in evaluating the results. We divided the training dataset into five cross-validation partitions. Multiple images present in the same source manuscript were included in the same partition. The test dataset included 34 images from 31 manuscripts. Each submission required a JSON file in the COCO format to be uploaded to the competition site on CodaLab [15].

2.2 Supplemental Data

To supplement the images in the training dataset, we included 4,533 images from volumes XV–LXXXII of the Oxyrhynchus Papyri (excluding volume XXVII). Images from these volumes included literary texts like in the competition dataset but it also included documentary texts. To help train models specifically for recognition, we used the second version of the AL-PUB dataset which contains 205,797 cropped images of individual characters taken from images in the Oxyrhynchus Papyri collection as part of the ‘Ancient Lives Project’ [21]. These images were labeled with the Greek letter from each image.

3 Method

We first trained models to perform both detection and recognition of characters and then models specifically for the recognition task.

3.1 Detection and preliminary recognition

For character detection and preliminary recognition, we used YOLO (You Only Look Once) [18]. The YOLO neural network model uses a series of convolutional layers to predict bounding boxes and probabilities for each categorical class simultaneously. Initial detections are made on coarse cells produced from the convolutions and, if the same object is predicted in multiple cells, they are corrected using non-maximal suppression. The release of YOLO9000 [16] and YOLOv3 [17] brought about multiple enhancements. In 2020, Glen Jocher released a PyTorch implementation of YOLOv3 which he named YOLOv5, published by Ultralytics [7]. Our models used Jocher’s later version YOLOv8[8]. We used the ‘x’ sized pretrained model which is reported to have achieved a mAP^val of 68.2 on the COCO val2017 dataset. We trained the model using the test dataset at three image resolutions: small (1280 $\times$ 1280), medium (1600 $\times$ 1600) and large (2048 $\times$ 2048). Each YOLOv8 model was trained for 200 epochs and the weights from the epoch with the best result on the validation set were saved. The trained model was used to make predictions on the unannotated Oxyrhynchus Papyri images which were used as pseudo-labels [10] and concatenated with the competition training set. The YOLOv8 models were then trained using the combined input from the competition dataset and the pseudo-labels.

3.2 Recognition specific models

To enhance the recognition performance of the YOLOv8 results, we trained additional specialized recognition models. This allowed us to incorporate additional labeled and unlabeled data, and to leverage self-supervised learning approaches.

In the first instance, we trained a self-supervised SimCLR model [2] on all of the available data containing cropped Greek characters. This included the competition dataset, the AL-PUB dataset, and the characters detected by the YOLOv8 models from the Oxyrhynchus Papyri.

The SimCLR approach takes a mini-batch of images as usual, but augments each image to produce two views and employs a contrastive loss to teach the network to match each augmented view in the batch with its positive pair. On datasets such as ImageNet and CIFAR-10, augmentations including cropping, horizontal flipping, color-jitter and Gaussian blur are used to get best results [2].

Selecting appropriate augmentations for the problem is important for models trained using SimCLR to perform well because these influence the image features leveraged by the model to distinguish different images within the training process [4]. In the context of Greek character detection, augmentations such as cropping and horizontal flipping may change the character sufficiently to make it a different class. Accordingly we reduced the degree of cropping, removed the horizontal flips and did stronger color-jitter and blur.

After pretraining a ResNet-50 convolutional neural network on the data, we fine-tuned the model [3] using the labels available in the competition and AL-PUB datasets using fivefold cross-validation.

In the second instance, we trained DeiT transformer models using transfer learning [22]. This was an entirely supervised process so we used the same labeled data as for the ResNet-50 SimCLR models with the same cross-validation splits. For this approach we also used less aggressive cropping during training, so that a significant portion of the character of interest would always be present during the training process.

3.3 Ensembling

For our final submission, we ensembled the results from our trained models. For character detection, we ensembled the results for the YOLOv8 trained models using Weighted Boxes Fusion [20]. We then cropped the characters resulting from this procedure and used our recognition models to make predictions for each. We first ensembled the predictions for the ResNet-50 SimCLR and DeiT-small models separately, using a hard majority voting approach. We then ensembled the YOLOv8 classification with the ResNet-50 SimCLR and DeiT-small overall votes, by adjusting the bounding box confidence by the proportion of votes for each method for a particular character.

Refer to caption — Figure 1: YOLOv8 results. The boxplots show the five results trained with a different validation partition. The plots on the left show the results on the held-out validation set without using the pseudo-labels from the Oxyrhynchus Papyri. The plots in the center show the improved results on the respective validation sets with the pseudo-labels. The plots on the right show the results on the test dataset. The orange line represents an ensemble of the models at the three resolutions. The horizontal purple line shows the result of the ensemble of all trained models.

4 Evaluation

The competition used the mean average precision (mAP) evaluation metric as used by the COCO dataset which averages results for ten intersection over union (IoU) values, with increments of 0.05 from 0.5 to 0.95. This was the primary metric which determined the winner for both the detection and recognition tasks. Results reported below use this metric unless stated otherwise. Also reported are the mean average precision and mean average recall (mAR) scores for an IoU value of 0.5, the traditional IoU value for evaluating object detection used in the pascal Visual Object Classes challenge [5], and an IoU of 0.75 which requires a stricter localization of the bounding box.

5 Results

5.1 Experimental Results

Fig. 1a shows the mean average precision results of cross-validation on the training dataset. Increasing the resolution yielded a small improvement for both detection and recognition. But adding pseudo-labels (fig. 1b) from predictions of the model on the Oxyrhynchus Papyri images resulted in a 6–8% increase for detection and a 8–11% increase in recognition.

Fig. 1c shows results on the test set for YOLOv8 models trained on the five cross-validation partitions and at the three image resolutions. We achieved better results with higher resolutions but with diminishing returns. The average mAP score for detection across the five cross-validation partitions at the small resolution was 48.7%. This increased to 49.5% at the medium resolution but a further increase to the large resolution only raised the average mAP to 49.6%. In terms of letter recognition, the lower resolution gave a mAP of 35.8% while the medium and large resolutions gave 37.3% and 37.6% respectively. Ensembling the five models resulted in a distinct improvement for both detection and recognition. The final ensemble combining the YOLOv8 models at all resolutions produced the best results with a mAP score of 51.8% for detection and 41.4% for recognition.

The results of the models trained specifically for recognition are shown in fig. 2. The YOLOv8 ensemble outperformed both the DeiT and the SimCLR models. But when they were ensembled, we achieved the highest recognition mAP score of 42.16%. This was our ultimate submission for the competition. The ensembling process adds new bounding boxes and modifies their confidence scores to account for each of the characters predicted to be present, which results in the slightly lower detection mAP of 51.42% for a larger gain in the recognition mAP.

We found that the SimCLR ResNet-50 and DeiT-Small models performed similarly in identifying the Greek character present in an image. Averaging the cross-validation results, we obtained 88.4% accuracy for the former and 87.4% accuracy for the latter (fig. 2a). The combination of these models achieved 89.1% accuracy.

The supplementary material contains a confusion matrix for our ensembled predictions, which shows that we obtained reasonable recall for every character, even those poorly represented in the dataset. The most common mistakes are between characters having a similar appearance, such as A and $\Lambda$ or Z and $\Xi$ .

5.2 Competition Results

The results of our submission in the detection task is shown in fig. 3a. In the primary metric (and at an IoU of 0.75) our submission was behind that of Vu & Aimar, giving us the second place for this task. At the lower IoU value of 0.5, we achieved a much higher mAP at 93.2%. This was the highest of all submissions. This indicates that while our bounding boxes were not as localized as the winning submission, our method was still able to accurately identify the vast majority of letters. In common with the other submissions, the recall scores were substantially better than the precision scores. We achieved an average recall of 98.6% at an IoU of 0.5 and 81.2% at an IoU of 0.75, which were the best of all submissions.

The results of our submission in the recognition task is shown in fig. 3b. In the primary metric we achieved a mAP of 42.2%, which was the winning submission for this task. As with detection, the result for an IoU of 0.75 was not as good as the submission by Vu & Aimar but at an IoU of 0.5 our result was 74.7%, more than 2% higher than the next closest submission. Our submission also ranked highest on average recall for both an IoU at 0.5 and 0.75.

6 Oxyrhynchus Papyri Predictions

To illustrate the potential of our approach to reduce the time required to transcribe manuscripts into a digital format, we applied our model to 4,533 documents from the Oxyrhynchus Papyri collection. Once we obtained the character bounding boxes and predictions for the scanned images of these papyri (as detailed in §3.3) we removed boxes with low confidence values and, where there was a large degree of overlap between boxes, we retained the box with the highest confidence and discarded the rest.

We then used feathering of the height and width of the boxes to identify lines of text and paragraphs, allowing us to convert the manuscripts to text. The outputs from this process are shown for a particular manuscript in fig. 4. While more advanced Optical Character Recognition (OCR) approaches might be used for these final steps, this simple approach allows us to conduct keyword searches highlighting the value of these models. For example, a search for the stem of the name Achilles (AXI $\Lambda$ *), finds 18 documents, four of which contain the text of the Iliad (2748, 3155, 3323, 4817) and in 12 of the remainder (2568, 2672, 2680, 2786, 2960, 2967, 3479, 3774, 4089, 4849, 4991, 5010), the text corresponds to the transcription in the Digital Corpus of Literary Papyri (DCLP) or the Duke Databank of Documentary Papyri (DDbDP) hosted by https://papyri.info. Only two hits of documentary papyri (2182 and 3250) showed false positives where the DDbDP transcription did not include the query string. Since the original training data came from literary papyri, the results for documentary texts in the Oxyrhynchus Papyri with a cursive script are less reliable than those with a bookhand.

We publicly release the bounding box data using the COCO annotation format, plain text and Text Encoding Initiative (TEI) XML at https://doi.org/10.6084/m9.figshare.25011140.v1.

7 Discussion

Since 1898, only a small fraction of the Oxyrhynchus Papyri has been catalogued and transcribed. This work, along with the character recognition models presented by Swindall et al. [21] and the other submissions to the ICDAR competition [19], demonstrate the potential of new machine learning and AI approaches to make large scale analysis of these documents feasible.

The ability to reliably detect bounding boxes for characters in Greek manuscripts, and to recognize the letters within them, also opens many avenues for quantitative analysis. For example, similarities between the characters from different fragments might be used to discover disiecta membra of the same manuscript (such as P.Oxy XVIII.2170 and PSI XI.1218 [11, 54]) or to identify separate manuscripts written by the same scribe (for a list of such instances, see in [9], pp. 61–65).

Similarly, the isolated characters could be used to automatically detect the style of the hand. The isolated characters could be used as input to a machine learning model to predict the approximate date of the manuscript (for a survey of automated techniques for manuscript dating see [13]). The localization of the characters on the page can automate the process of identifying the widths and heights of the columns, similar to the analysis of [9], pp. 162–212. The widths and heights of individual characters could also be used to refine probabilistic models for assessing the content that could have been in manuscript lacunae, such as that of [12].

We present the results of our submission to the ‘ICDAR 2023 Competition on Detection and Recognition of Greek Letters on Papyri’. We used an ensemble of YOLOv8 models to detect and recognize individual characters and refined the recognition predictions by DeiT and SimCLR models. Our predictions were enhanced by supplementary data from the Oxyrhynchus Papyri. Our submission was second in the detection challenge with a mAP for IoU values from 0.5 to 0.95 as 51.42%, behind the first place winner of 51.83%. The mAP at an IoU value of 0.5 produced a mAP score of 93.2% and mAR at 98.6%, which were the highest of all submissions. For the recognition challenge, our submission produced a mAP of 42.2% for IoU values from 0.5 to 0.95 and was the winning submission. For the less strict IoU value of 0.5 we achieved the best results with a mAP of 74.7% and mAR of 69.9%. We release our predictions on the Oxyrhynchus Papyri in COCO annotation format, plain text, and TEI XML anticipating that others might use these results for further analysis.

\bmhead

Supplementary information

The confusion matrix from the cross validation results of the recognition specific models discussed in §5.1 is found in the supplementary material.

\bmhead

Acknowledgements

We thank David Turnbull, Kamalpreet Singh and Daniel Russo-Batterham for assistance with different aspects of this project.

Statements and Declarations

This research was supported by The University of Melbourne’s Research Computing Services and the Petascale Campus Initiative. This project was undertaken with the assistance of resources and services from the National Computational Infrastructure (NCI), which is supported by the Australian Government. Evelyn Mannix’s contribution to this research was supported by an Australian Government Research Training Program (RTP) Scholarship.

\bmhead

Financial interests The authors declare they have no financial interests.

\bmhead

Non-financial interests None.

\bmhead

Data availability The training and validation data for the competition is available at https://codalab.lisn.upsaclay.fr/competitions/12419. Oxyrhynchus Papyri images are available at https://oxyrhynchus.web.ox.ac.uk/images. The AL-PUB version 2 dataset is available at https://data.cs.mtsu.edu/al-pub/. Our detections of characters from the Oxyrhynchus Papyri is available at https://doi.org/10.6084/m9.figshare.25011140.v1.

References

\bibcommenthead
Alabau et al [2014] Alabau V, Martínez-Hinarejos CD, Romero V, et al (2014) An iterative multimodal framework for the transcription of handwritten historical documents. Pattern Recognition Letters 35:195–203. https://doi.org/10.1016/j.patrec.2012.11.007, URL https://www.sciencedirect.com/science/article/pii/S0167865512003765, frontiers in Handwriting Processing
Chen et al [2020a] Chen T, Kornblith S, Norouzi M, et al (2020a) A simple framework for contrastive learning of visual representations. In: III HD, Singh A (eds) Proceedings of the 37th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol 119. PMLR, pp 1597–1607, URL https://proceedings.mlr.press/v119/chen20j.html
Chen et al [2020b] Chen T, Kornblith S, Swersky K, et al (2020b) Big self-supervised models are strong semi-supervised learners. In: Larochelle H, Ranzato M, Hadsell R, et al (eds) Advances in Neural Information Processing Systems, vol 33. Curran Associates, Inc., pp 22243–22255, URL https://proceedings.neurips.cc/paper_files/paper/2020/file/fcbc95ccdd551da181207c0c1400c655-Paper.pdf
Chen et al [2021] Chen T, Luo C, Li L (2021) Intriguing properties of contrastive losses. In: Ranzato M, Beygelzimer A, Dauphin Y, et al (eds) Advances in Neural Information Processing Systems, vol 34. Curran Associates, Inc., pp 11834–11845, URL https://proceedings.neurips.cc/paper_files/paper/2021/file/628f16b29939d1b060af49f66ae0f7f8-Paper.pdf
Everingham et al [2010] Everingham M, Van Gool L, Williams CKI, et al (2010) The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88(2):303–338. 10.1007/s11263-009-0275-4, URL https://doi.org/10.1007/s11263-009-0275-4
Faigenbaum-Golovin et al [2022] Faigenbaum-Golovin S, Shaus A, Sober B (2022) Computational handwriting analysis of ancient hebrew inscriptions—a survey. IEEE BITS the Information Theory Magazine 2(1):90–101. 10.1109/MBITS.2022.3197559
Jocher [2020] Jocher G (2020) YOLOv5 by Ultralytics. 10.5281/zenodo.3908559, URL https://github.com/ultralytics/yolov5
Jocher et al [2023] Jocher G, Chaurasia A, Qiu J (2023) YOLO by Ultralytics. URL https://github.com/ultralytics/ultralytics
Johnson [2004] Johnson WA (2004) Bookrolls and Scribes in Oxyrhynchus. University of Toronto Press
Lee et al [2013] Lee DH, et al (2013) Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In: Workshop on challenges in representation learning, ICML, Atlanta, p 896
Lobel et al [1941] Lobel E, Roberts C, Wegener E (eds) (1941) The Oxyrhynchus Papyri: Part XVIII. Egypt Exploration Society
McCollum [2022] McCollum J (2022) Likelihood calculations for reconstructed lacunae and Papyrus 46’s text of Ephesians 6:19. Digital Scholarship in the Humanities 38(2):647–657. 10.1093/llc/fqac078
Omayio et al [2022] Omayio EO, Indu S, Panda J (2022) Historical manuscript dating: traditional and current trends. Multimedia Tools and Applications 81(22):31573–31602. 10.1007/s11042-022-12927-8
Parsons [2007] Parsons P (2007) City of the Sharp-Nosed Fish: Greek Lives in Roman Egypt. Weidenfeld and Nicolson
Pavao et al [2022] Pavao A, Guyon I, Letournel AC, et al (2022) CodaLab Competitions: An open source platform to organize scientific challenges. Technical report, Université Paris-Saclay, FRA., URL https://inria.hal.science/hal-03629462
Redmon and Farhadi [2017] Redmon J, Farhadi A (2017) Yolo9000: Better, faster, stronger. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 6517–6525, 10.1109/CVPR.2017.690
Redmon and Farhadi [2018] Redmon J, Farhadi A (2018) YOLOv3: An Incremental Improvement. URL http://arxiv.org/abs/1804.02767
Redmon et al [2016] Redmon J, Divvala S, Girshick R, et al (2016) You only look once: Unified, real-time object detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 779–788, 10.1109/CVPR.2016.91
Seuret et al [2023] Seuret M, Marthot-Santaniello I, White SA, et al (2023) ICDAR 2023 Competition on Detection and Recognition of Greek Letters on Papyri. In: Fink GA, Jain R, Kise K, et al (eds) Document Analysis and Recognition - ICDAR 2023. Springer Nature Switzerland, Cham, pp 498–507, 10.1007/978-3-031-41679-8_29
Solovyev et al [2021] Solovyev R, Wang W, Gabruseva T (2021) Weighted boxes fusion: Ensembling boxes from different object detection models. Image and Vision Computing 107:104117. https://doi.org/10.1016/j.imavis.2021.104117, URL https://www.sciencedirect.com/science/article/pii/S0262885621000226
Swindall et al [2021] Swindall MI, Croisdale G, Hunter CC, et al (2021) Exploring learning approaches for ancient greek character recognition with citizen science data. In: 2021 IEEE 17th International Conference on eScience (eScience), pp 128–137, 10.1109/eScience51609.2021.00023
Touvron et al [2021] Touvron H, Cord M, Douze M, et al (2021) Training data-efficient image transformers & distillation through attention. In: Meila M, Zhang T (eds) Proceedings of the 38th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol 139. PMLR, pp 10347–10357, URL https://proceedings.mlr.press/v139/touvron21a.html