This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

VGTS: Visually Guided Text Spotting for Novel Categories in Historical Manuscripts

Wenbo Hu [email protected] Hongjian Zhan [email protected] Xinchen Ma [email protected] Cong Liu [email protected] Bing Yin [email protected] Yue Lu [email protected] Ching Y. Suen [email protected] School of Communication and Electronic Engineering, East China Normal University, Shanghai, 200062, China Shanghai Key Laboratory of Multidimensional Information Processing, Shanghai, 200241, China Centre for Pattern Recognition and Machine Intelligence, Concordia University, Montreal, H3G 1M8, Canada iFLYTEK Research, iFLYTEK, Hefei, 230088, China
Abstract

In the field of historical manuscript research, scholars frequently encounter novel symbols in ancient texts, investing considerable effort in their identification and documentation. Although existing object detection methods achieve impressive performance on known categories, they struggle to recognize novel symbols without retraining. To address this limitation, we propose a Visually Guided Text Spotting (VGTS) approach that accurately spots novel characters using just one annotated support sample. The core of VGTS is a spatial alignment module consisting of a Dual Spatial Attention (DSA) block and a Geometric Matching (GM) block. The DSA block aims to identify, focus on, and learn discriminative spatial regions in the support and query images, mimicking the human visual spotting process. It first refines the support image by analyzing inter-channel relationships to identify critical areas, and then refines the query image by focusing on informative key points. The GM block, on the other hand, establishes the spatial correspondence between the two images, enabling accurate localization of the target character in the query image. To tackle the example imbalance problem in low-resource spotting tasks, we develop a novel torus loss function that enhances the discriminative power of the embedding space for distance metric learning. To further validate our approach, we introduce a new dataset featuring ancient Dongba hieroglyphics (DBH) associated with the Naxi minority of China. Extensive experiments on the DBH dataset and other public datasets, including EGY, VML-HD, TKH, and NC, show that VGTS consistently surpasses state-of-the-art methods. The proposed framework exhibits great potential for application in historical manuscript text spotting, enabling scholars to efficiently identify and document novel symbols with minimal annotation effort.

keywords:
Text Spotting , Low-resource , One-shot Learning , Historical Manuscripts
journal: Expert Systems With Applications

1 Introduction

To preserve cultural heritage, archives create digital libraries by scanning or photographing historical manuscripts [Ma et al., 2020]. Human summarization remains essential in deciphering and categorizing these ancient texts, which often consist of handwritten characters. Researchers employ Optical Character Recognition (OCR) techniques [Yousef and Bishop., 2020] to facilitate manuscript digitization. However, the digitization of historical manuscripts faces significant challenges due to inherent limitations:

(1) Open-set problem. Most methods perform well on trained categories but falter when encountering novel characters, which is a common scenario in historical manuscripts [Wang et al., 2021a]. This challenge is exacerbated as researchers often discover new categories while examining ancient texts, as detailed in the appendix. While unsupervised methods address this challenge, their effectiveness is limited [Wicht et al., 2016, Zagoris et al., 2017, 2021].

(2) Long-tailed distribution. The reliance on extensive annotated data is problematic in this domain, given that historical manuscripts typically feature a limited number of pages and scarce training samples. The long-tailed distribution of characters in ancient manuscripts worsens data sparsity, with some characters appearing only once [Wang et al., 2023]. For instance, in the TKH dataset [Yang et al., 2018], out of 1492 categories, 436 appeared only once.

Moreover, traditional methods necessitate text lines as input, requiring time-consuming and error-prone preprocessing, layout analysis, and character segmentation [Bissacco et al., 2013, Xiu et al., 2019]. While recent efforts focus on page-level text spotting [Rothacker et al., 2017], they still face challenges like error propagation and the need for substantial training data [Feng et al., 2019, Qiao et al., 2020]. Additionally, certain historical characters, such as Dongba hieroglyphics or notary signs [Seitzer and Christlein., 2018], are not compatible with most OCR methods [Nguyen et al., 2019, Li et al., 2022, Zhong et al., 2024] due to the absence in standard input method editors, further complicating the textualization of results.

Recently, cognitive research has demonstrated that humans possess a remarkable ability to recognize objects when provided with guidance [Lake et al., 2015], as exemplified by human visual searches in scenes with the aid of literacy cards, among other contextual cues. In this process, humans heavily rely on knowledge of the supporting images and contextual information. This dependence on support images (e.g. literacy cards) is believed to stem from the prefrontal cortex and is projected to lower-level visual cortex structures. Support information acts as a reference, directing attention to specific visual features. Meanwhile, spatial context information narrows the search area, guiding attention towards more relevant locations. The integration of both sources of information results in the identification of the region of interest.

To preserve culture and better manage historical manuscripts, we propose a one-shot learning model inspired by human learning cognitive research. One-shot learning focuses on learning patterns of a specific category and generalizing them to unseen categories using only a single annotated example, instead of numerous annotated samples. Our model takes an antique book page image and an example image of the desired character class, spotting all characters in the manuscript, even if the support image belongs to an unseen category. To address example imbalance, we introduce a novel torus loss function, enhancing the distance metric’s embedding space discriminability. Our approach handles varying characters and symbols, such as hieroglyphics, Arabic, Chinese, and notary signs in historical manuscripts, offering a flexible and efficient solution for text spotting. Our contributions can be summarized as follows:

(1) We propose a flexible text spotting model that accurately and reliably identifies novel characters with just one annotated example, eliminating the need for additional fine-tuning or retraining.

(2) We propose an innovative dual spatial attention block, which extracts discriminative features from the support set to direct the model’s focus to crucial regions in the query image. We also introduce a novel ’torus loss’, designed to concentrate on challenging examples and to develop a more effective distance metric for text spotting in low-resource settings.

(3) We contribute a new dataset of ancient Dongba script manuscripts, comprised of hieroglyphic characters.

2 Related work

2.1 Text Spotting

Certain segmentation-based methods [Kesidis et al., 2011, Khurshid et al., 2012, Zagoris et al., 2014] assume that the dataset is already partitioned into text lines or words, thereby rendering these methods more akin to an image retrieval task, where the goal is to retrieve the most relevant images in the dataset. Additionally, some text spotting techniques typically detect each text instance using a trained detector, then identify the cropped text region using a sequence decoder [Zhou et al., 2017, Shi et al., 2016]. However, such a strategy undermines performance robustness since the detection and recognition phases are separated, and the error from the recognition model cannot be used to optimize the detector. In recent times, there has been growing interest in end-to-end scene text detection and recognition, which offer the benefit of achieving both detection and recognition tasks by sharing the acquired features between the two steps [Feng et al., 2019, Qiao et al., 2020, Wang et al., 2020, Liu et al., 2020a, Wang et al., 2021b]. Nevertheless, these methods usually require an extensive corpus of data to enable adequate training and are unable to identify new categories. Wicht et al. [2016] using a sliding window scheme with deep features to accomplish the text spotting task is also one of the solutions. This approach can be seen as a kind of unsupervised learning, so there is no open-set problem, unfortunately, this sliding window based approach usually does not achieve the best results. Wilkinson et al. [2017] propose a word spotting method based on the “query-by-string” approach, which allows the user to retrieve all instances of a given string in a document image. However, while this method can be effective for seen characters, it does not address the challenge of mining unseen characters, which is particularly relevant for ancient texts.

Furthermore, many ancient texts do not yet have a complete lexicon, and consequently, there are no corresponding input method editors, making it challenging to identify and retrieve specific strings in these documents. Historical manuscripts often pose a significant challenge due to their limited availability of annotated images. For instance, the Pascal VOC dataset [Everingham et al., 2010], widely utilized in few-shot object detection, includes a meager 1.6 categories and 2.9 instances per image on average. In stark contrast, the data distribution of historical manuscript images is significantly broader, as evidenced by the TKH dataset, which boasts an average of 72.5 categories and 323.8 character instances per image. Drawing inspiration from few-shot object detection and to address the challenge of limited data in historical manuscripts, Souibgui et al. [2022] design a few-shot character recognition method that aims to overcome the low-resource problem and enable the model to recognize unseen alphabets. Nonetheless, Souibgui et al. [2022, 2021] are only applicable to text line images, necessitating prior layout analysis to segment the original page into lines. In contrast, our proposed method can process page images without requiring additional layout analysis.

Refer to caption
Fig. 1: The overall framework of VGTS. The proposed framework can spot novel interested character categories conditioned with one support image. Based on the feature extracted by the backbone network, the correlation matching module computes the correlation map to match the pair of individual feature maps, while the spatial alignment module predicts the localization boxes and spotting results.

2.2 Few-shot Object Detection

In the realm of few-shot learning, extent techniques have predominantly been designed for classification tasks. Optimization-based methods effectively assimilate novel categories via gradient-based optimization on a paltry number of annotated samples [Rusu et al., 2018, Sun et al., 2019, Jia et al., 2024]. Conversely, the metric-based learning paradigm seeks to glean a common feature space and demarcate categories based on a distance metric [Oreshkin et al., 2018, Zhang et al., 2023]. One-shot learning [Chen et al., 2019, Tsutsui et al., 2022] stands as an extreme manifestation of the few-shot learning paradigm, characterized by each novel encountered category boasting a singular labeled example.

The goal of few-shot object detection is to discern the location and identity of objects in query images with the aid of a limited set of labeled support images [Hsieh et al., 2019, Osokin et al., 2020, Fan et al., 2020, 2021, Cheng et al., 2021, Huang et al., 2022, He et al., 2023, Lim et al., 2024]. To circumvent the open-set problem, a two-stage approach is often employed, in which the model is first trained on base categories and then fine-tuned on a small number of samples for novel categories. However, this approach can lead to confusion between novel and base categories, resulting in decreased model efficacy as the number of novel categories increases [Fan et al., 2020, 2021]. In situations where training samples are scarce, some approaches may have to struggle to locate regions of interest for novel objects, particularly those with unlearnable shape priors and fine-tuned RPNs [Ren et al., 2017]. However, certain methods have demonstrated the ability to detect new categories without fine-tuning, such as OS2D [Osokin et al., 2020], which uses a similarity map and learned dense matches to establish correspondence between support and query images. This model excels at learning pixel-matching relations between the support and query images, thereby allowing it to handle novel classes without fine-tuning. Our proposed model is also based on dense correlation matching features. However, in few-shot tasks with a small number of samples, attention must be explicitly directed toward the target item (based on the support image) since most visual prediction models exhibit free-viewing behavior. Compared with OS2D, our model design strategy is closer to the human cognitive process, despite the different objectives of our task. Additionally, we discuss the spotting process in low-resource scenarios and propose a new loss function for it. CoAE [Hsieh et al., 2019] is also a method that does not require fine-tuning, proposing a non-local RPN to tackle the problem of one-shot object detection. However, RPN-based methods often use a predefined set of anchor boxes with different scales and aspect ratios to generate region proposals, which may not be able to capture the minute variations in size and aspect ratio of small objects, especially in historical manuscripts.

3 Proposed method

By providing only a single sample of each character type in the support image gallery, the model can effectively locate the characters within the manuscript. As illustrated in Fig. 1, the proposed model comprises two modules: the correlation matching module and the spatial alignment module. In the correlation matching module, a feature extractor initially extracts the features of both the query and support images. The relationship between feature map pairs is then computed to obtain the correlation map. Subsequently, the spatial alignment module learns the transformation parameters that enable the model to localize the support image within the query image. Since certain characters in ancient manuscripts cannot yet be inputted using an input editor, we select items from the support image gallery SgalleryCbaseCnovelS_{gallery}\in C_{base}\cup C_{novel} and paste them onto a blank image of the same size as the query image IqTtestI_{q}\in T_{test}. This creates a new reference image to be displayed as the result of text spotting, where CbaseC_{base} denotes the base category that appears during the training process, and CnovelC_{novel} denotes the novel category that does not appear during training.

3.1 Correlation Matching

3.1.1 Feature Extractor

For the input support and query images, (Is,Iq)(I_{s},I_{q}), are passed through two feature extraction CNN branches with shared weights. Since most of the antique manuscripts are insufficient to train the network from scratch, the feature extractor can only gain experience from the base category. It cannot get targeted training from the novel category. For each input support image IsiI_{s}^{i}, the feature map fs^iRhs^i×ws^i×Df_{\hat{s}}^{i}\in R^{h_{\hat{s}}^{i}\times w_{\hat{s}}^{i}\times D} obtained after backbone, is uniformly resized to the same size fs^fsRhs×ws×Df_{\hat{s}}\rightarrow f_{s}\in R^{h_{s}\times w_{s}\times D}. Finally, the feature extraction networks extract the support feature F(Is)=fsRhs×ws×DF(I_{s})=f_{s}\in R^{h_{s}\times w_{s}\times D} and query feature F(Iq)=fqRhq×wq×DF(I_{q})=f_{q}\in R^{h_{q}\times w_{q}\times D}, respectively, where FF represents the backbone, h×wh\times w is the spatial resolution and DD is the channels.

3.1.2 Correlation Map

To ensure the model can have sufficient generalizability to unseen categories. Intuitively, the model must fully avail of the reference information provided by the support image. A common strategy for computing the correlation map is to match a pair of individual feature maps using normalized cosine similarity [Rocco et al., 2017, Liu et al., 2020b]. Given a pair of individual features fsRhs×ws×Df_{s}\in R^{h_{s}\times w_{s}\times D} and fqRhq×wq×Df_{q}\in R^{h_{q}\times w_{q}\times D} of support and query images, the correlation map is computed as:

Cabkl=ϕ(fq,fs)=fqfsTfqfsRhq×wq×hs×ws,C_{abkl}=\phi(f_{q},f_{s})=\frac{f_{q}\cdot f_{s}^{T}}{\|f_{q}\|\|f_{s}\|}\in R^{h_{q}\times w_{q}\times h_{s}\times w_{s}}, (1)

where ϕ()\phi(\cdot) means cosine similarity, and CabklC_{abkl} denotes the matching score between the (a,b)(a,b)-th position in query feature map and (k,l)(k,l)-th position in support feature map. The result is thus a 4D tensor that captures the similarities between all pairs of spatial locations.

Refer to caption
Fig. 2: Correlation map computation using pairs of individual features involves first extracting image descriptors fsf_{s} and fqf_{q} from images IsI_{s} and IqI_{q}, respectively. Subsequently, all pairs of individual feature matches, fsklf_{s}^{kl} and fqabf_{q}^{ab}, are represented in the 4D space of matches (a,b,k,l)(a,b,k,l), where the matching score is stored in a 4D correlation tensor. This 4D tensor is then reshaped into a 3D tensor with dimensions hqh_{q}, wqw_{q}, and (hs×ws)(h_{s}\times w_{s}), allowing for the construction of a correlation map. At a specific spatial location (a,b)(a,b), the correlation map CrC_{r} provides an aggregation of all similarities between fq(a,b)f_{q}(a,b) and all fsf_{s}.

3.2 Spatial Alignment

The spatial alignment module consists of two blocks, the dual spatial attention (DSA) block and the geometric matching (GM) block. The primary purpose of the dual spatial attention block is to find, focus and learn discriminative spatial regions on support/query images. And the primary purpose of the geometric matching block is to find the spatial mapping relationship between the support image and the query image. These two blocks stimulate each other, which helps to get more accurate text spotting results.

Unlike the geometric matching methods [Zhao et al., 2021, Kim et al., 2022a, b] that have been utilized for correspondence learning, the number of characters that appear in the antique manuscripts is very large, the character shapes are usually tiny, and the character-to-character constructions are very similar. For this reason, we develop a spatial alignment module to facilitate locating the position of the character instance given by the support image on the query image at the dense pixel level. Specifically, the spatial alignment module aims to find spatial correspondence between a pair of support and query images. To achieve this, the obtained 4-dimensional correlation map, can be reshaped to a 3-dimensional tensor with dimensions hqh_{q},wqw_{q} and (hs×ws)(h_{s}\times w_{s}), i.e., Cr=Rhq×wq×(hs×ws)C_{r}=R^{h_{q}\times w_{q}\times(h_{s}\times w_{s})}. The reshaped correlation map CrC_{r} can be considered as a dense hq×wqh_{q}\times w_{q} grid with (hs×ws)(h_{s}\times w_{s})-dimensional local features as shown in Fig. 2.

Refer to caption
Fig. 3: Flowchart of dual spatial attention block. Given the correlation map CrC_{r} from the correlation matching module, the first step is to attention to the dimension d=(hs×ws)d=(h_{s}\times w_{s}) of the correlation map, which can be regarded as a refinement for the support image. The next step is to attention to the dimensions wqw_{q} and hqh_{q} of the correlation map, which can be regarded as a refinement for the query image.

3.2.1 Dual Spatial Attention Block

When humans are guided by support images, during a glance, human will remember the most critical salient regions that contribute to visual spotting. Besides, when we need to find a character in a page of a book, blank areas or illustrations are ignored in the process of visual spotting. Thus, the spotting area is narrowed to concentrate on potential regions of interest within the query image, which means human will focus potential discriminative area.

Step 1. Refinement for the support image. Given the correlation map CrRhq×wq×dC_{r}\in R^{h_{q}\times w_{q}\times d} by the correlation matching module, where d=(hs×ws)d=(h_{s}\times w_{s}). The correlation map not only contains all information from the support and query images, but also captures the similarity between the two images. For correlation map CrC_{r}, each channel djd_{j} of CrC_{r} represents the correlation of the jj-th pixel on the feature map {fs}j\{f_{s}\}_{j} with fqf_{q}, we produce a 1d spatial-wise attention map to focus “where” is meaningful localization in the support image, as shown in Fig. 3. We first feed the obtained CrC_{r} into the dual spatial attention block to find regions with key discriminative properties in the support image. Aggregating the spatial information of the correlation map CrC_{r} by using average-pooling and max-pooling operations, generating two different spatial context descriptors fcavgf^{avg}_{c} and fcmaxf^{max}_{c}. After that, features pooled by each pooling layer are then passed through the support refinement network, which consists of two convolutions layer with kernel size 1. The support refinement network analyzes the correlation map CrC_{r} to identify the most important regions in the support image by learning the relationships between different channels. Then the element-wise summation is used to merge the results which have:

Ms=σs(W1(W0(AvgPool(Cr)))+W1(W0(MaxPool(Cr)))),\begin{split}M_{s}=&\sigma_{s}(W_{1}(W_{0}(AvgPool(C_{r})))\\ &+W_{1}(W_{0}(MaxPool(C_{r})))),\end{split} (2)

where W0Rd×(d/τ)W_{0}\in R^{d\times(d/\tau)}, W1R(d/τ)×dW_{1}\in R^{(d/\tau)\times d} are parameters of convolutions, σs\sigma_{s} represents the sigmoid activation function and τ\tau is the reduction ratio. Then multiply MsR1×1×dM_{s}\in R^{1\times 1\times d} with the correlation map CrC_{r}, expressed as:

Crs=MsCr,\begin{split}C_{r}^{s}=M_{s}\otimes C_{r},\end{split} (3)

where CrsC_{r}^{s} captures the key points in spatial dimension of the support image and \otimes denotes element-wise multiplication.

Step 2. Refinement for the query image. Note that, at a particular position (a,b)(a,b) of CrsC_{r}^{s} contains the similarities between fq(a,b)f_{q}(a,b) and all the features of fsf_{s}. As illustrated in Fig. 3, we also generate a 2d spatial attention map to focus on the positional informative key point for the query image. By using average-pooling and max-pooling operations along the channel dimension and concatenating the pooled features. Then a convolution layer is applied to generate the spatial attention map:

Mq=σq(Conv([AvgPool(Crs);MaxPool(Crs)])),\begin{split}M_{q}=&\sigma_{q}(Conv([AvgPool(C_{r}^{s});MaxPool(C_{r}^{s})])),\end{split} (4)

where σq\sigma_{q} represents the sigmoid activation function, and ConvConv represents a convolution operation with the filter size of 3×33\times 3. To obtain the final refined features, we multiply MqRhq×wq×1M_{q}\in R^{h_{q}\times w_{q}\times 1} with CrsC_{r}^{s}, which can be expressed briefly as:

Crsq=MqCrs,\begin{split}C_{r}^{sq}=M_{q}\otimes C_{r}^{s},\end{split} (5)

where CrsqC_{r}^{sq} captures the key points in the spatial dimension of the query image and retains the key points focus of the support image.

3.2.2 Geometric Matching Block

For the obtained attention correlation map CrsqC_{r}^{sq}, the geometric matching block is applied to find the spatial correspondence between IsI_{s} and IqI_{q}. To achieve this, we optimize a transformation function Wθ:22W_{\theta}:\mathbb{R}^{2}\rightarrow\mathbb{R}^{2}, where θ\theta denotes the alignment parameter. The spatial coordinates between IqI_{q} and IsI_{s} can be found by (k,l)=Wθ(k,l)(k’,l’)=W_{\theta}(k,l), where (k,l)(k’,l’) are the corresponding spatial coordinates of (k,l)(k,l) in fqf_{q}. Similar to [Rocco et al., 2018], the geometric matching block consists of two consecutive convolutional layers without padding and stride equal to 1, with batch normalization and ReLU connected between the two convolutional layers, and finally a fully-connected layer to regress the alignment parameter θ\theta. The goal of the GM block is to find the best parameters θ^\hat{\theta}:

θ^=argmax𝜃(k,l)ϕ(fWθ(k,l)q,fk,ls),\hat{\theta}=\underset{\theta}{\arg\max}\sum_{(k,l)}\phi(f^{q}_{W_{\theta}(k,l)},f^{s}_{k,l}), (6)

where (k,l)ϕ(fWθ(k,l)q,fk,ls)\underset{(k,l)}{\sum}\phi(f^{q}_{W_{\theta}(k,l)},f^{s}_{k,l}) means the summation of the feature similarities between the support feature and the regions corresponding to those on the query image obtained through the 2D geometric transformation. After that, the obtained transformations are input to the grid sampler to generate a grid of points, which is aligned with the support image at each location of the query image.

Note that, the process of location is already done implicitly in the alignment process, by extracting the position frames based on the maximum and minimum values of the obtained grid tensor, by simply outputting a square tight bounding box based on the transformed grid points. The spotting score can be obtained by computing (k,l)ϕ(fWθ(k,l)q,fk,ls)\underset{(k,l)}{\sum}\phi(f^{q}_{W_{\theta}(k,l)},f^{s}_{k,l}), i.e., the similarity score of the local region on the query image feature fWθ(k,l)qf^{q}_{W_{\theta}(k,l)} to the input support image feature fk,lsf^{s}_{k,l}.

3.3 Loss Function

The loss function needs to construct the training objective from both localization and spotting [Radenović et al., 2018, Osokin et al., 2020, Fan et al., 2020].

Localization Loss. In this study, the smooth L1L_{1} loss is used for localization:

lloc(v,t)={𝑝12(vptp)2,if |vptp|<1.𝑝|vptp|12,otherwise.l_{loc}(v,t)=\begin{cases}\underset{p}{\sum}\frac{1}{2}(v_{p}-t_{p})^{2},&\text{if $|v_{p}–t_{p}|<1$}.\\ \underset{p}{\sum}|v_{p}–t_{p}|–\frac{1}{2},&\text{otherwise.}\end{cases} (7)

where v=(vx,vy,vw,vh)v=(v_{x},v_{y},v_{w},v_{h}) denotes the frame coordinates of GT and t=(tx,ty,tw,th)t=(t_{x},t_{y},t_{w},t_{h}) denotes the frame coordinates obtained by prediction.

Spotting Loss. Our task has an inherent difficulty to balance positive and negative samples. Ideally, the distance metric of positive examples is infinite, while the distance metric of negative examples should be close to 0. To separate positive and negative samples, the distance metric of positive samples should be greater than margin mposm_{pos}, while the distance metric of negative samples should be less than margin mnegm_{neg}. Therefore, the hinge-embedding loss with margins can be written as follows:

lposi=max(mpossi,0),\begin{split}l_{pos}^{i}=max(m_{pos}-s_{i},0),\end{split} (8)
lnegi=max(simneg,0),\begin{split}l_{neg}^{i}=max(s_{i}-m_{neg},0),\end{split} (9)

where mposm_{pos} and mnegm_{neg} are the positive and negative margins, and s[1,1]s\in\left[-1,1\right] is the spotting score. To separate positive and negative examples, the contrastive loss [Chopra et al., 2005] is being considered:

Lc=𝑖(lposi+lnegi).\begin{split}L_{c}&=\underset{i}{\sum}(l_{pos}^{i}+l_{neg}^{i}).\end{split} (10)

To acquire discriminative embeddings through contrastive learning, Wang et al. [2019] introduced the ranked list (RL) loss which can adaptively mine negative samples by providing a weighting parameter. The RL loss can be formulated as follows:

Lrl=𝑖(lposi+wilnegi),\begin{split}L_{rl}&=\underset{i}{\sum}(l_{pos}^{i}+w_{i}l_{neg}^{i}),\end{split} (11)

where wiw_{i} is defined as wi=exp(T(simneg)),simneg>0w_{i}=exp(T\cdot(s_{i}-m_{neg})),s_{i}-m_{neg}>0, and TT is the temperature parameter which controls the degree of weighting negative examples. However this loss function can not deal with the challenging examples.

The RL loss is designed to assign different weights to negative examples and measure the distance between sis_{i} and mnegm_{neg}, thereby alleviating the issue of imbalanced examples. Fig. 4(a) depicts the data distribution before training. During training (Fig. 4(b)), examples C and D are expected to incur a large loss from the RL loss (Eq. (11)). However, the challenging examples A and B within the margin gap will receive a relatively small loss, which is suboptimal for achieving the desired separation between positive and negative examples. To address this issue, it is desirable for the values of positive examples to exceed the margin threshold mposm_{pos}, whereas the values of negative examples should fall below the margin mnegm_{neg}. Accordingly, we aim to focus on the challenging examples in the margin gap and give them more attention to achieve a clear separation between positive and negative examples. For the example in the margin gap, we set:

gi={mpos,simpossi,mneg<si<mposmneg,simnegg_{i}=\left\{\begin{array}[]{rl}m_{pos},&s_{i}\geq m_{pos}\\ s_{i},&m_{neg}<s_{i}<m_{pos}\\ m_{neg},&s_{i}\leq m_{neg}\end{array}\right. (12)

Then the improved hinge-embedding loss with margins can be formulated as follows:

lposi=lposilog(gimpos),\begin{split}l_{pos^{\prime}}^{i}=l_{pos}^{i}-log(\frac{g_{i}}{m_{pos}}),\end{split} (13)
lnegi=wilnegilog(mpos+mneggimpos),\begin{split}l_{neg^{\prime}}^{i}=w_{i}l_{neg}^{i}-log(\frac{m_{pos}+m_{neg}-g_{i}}{m_{pos}}),\end{split} (14)

where wiw_{i} is also defined as wi=exp(T(simneg)),simneg>0w_{i}=exp(T\cdot(s_{i}-m_{neg})),s_{i}-m_{neg}>0. When the sis_{i} falls in the margin gap, both positive and negative examples receive extra loss, which help to further separate them. For the log(gi/mpos)-log(g_{i}/m_{pos}), we want the value of positive examples to exceed the margin threshold mposm_{pos}. If the spotting score is greater than mposm_{pos}, it meets the optimization goal and no further optimization is necessary. Similarly, for the log((mpos+mneggi)/(mpos))-log((m_{pos}+m_{neg}-g_{i})/(m_{pos})), we expect the value of negative examples to fall below the margin threshold mnegm_{neg}.

Refer to caption
Fig. 4: Data distribution analysis. (a) Prior to training, the data may be distributed in a complex and nonlinear fashion, with no clear separation between positive and negative examples. (b) During training, the model may be able to identify some patterns and achieve a degree of separation between positive and negative examples. Some examples may still remain in the margin gap, where the decision boundary is unclear and further optimization is needed to achieve greater separation. (c) To achieve optimal performance, we expect the values of positive examples to be greater than the margin mposm_{pos}, while the values of negative examples should be less than the margin mnegm_{neg}.

The new loss function which improved RL loss which called torus loss can be formulated as follows:

Ltorus=𝑖(lposi+lnegi).\begin{split}L_{torus}&=\underset{i}{\sum}(l_{pos^{\prime}}^{i}+l_{neg^{\prime}}^{i}).\end{split} (15)

The torus loss not only addresses the issue of challenging examples in the margin gap, but also helps to mitigate the impact of example imbalance. Therefore, we can use this function as the spotting loss.

Training Objective. Finally, the total loss of our model can be expressed as:

L=λLloc+Ltorus,L=\lambda L_{loc}+L_{torus}, (16)

where Lloc=𝑖lloc(vi,ti)L_{loc}=\underset{i}{\sum}l_{loc}(v_{i},t_{i}) and Ltorus=𝑖(lposi+lnegi)L_{torus}=\underset{i}{\sum}(l_{pos^{\prime}}^{i}+l_{neg^{\prime}}^{i}). λ\lambda is a hyperparameter that balances the importance of the two loss terms.

4 Experiments

4.1 Datasets

To thoroughly evaluate the efficacy of our proposed method and contribute meaningfully to the research community, we have meticulously curated existing public datasets and created a new dataset.

Dongba Hieroglyphics dataset (DBH) is a new dataset featuring Dongba characters, an ancient script created by the Naxi minority’s ancestors in China. These pictographs hold historical and literary value and have been recognized as ”Memory of the World” by UNESCO. As the total number of distinct Dongba characters is unknown, text spotting in manuscripts is an open-set problem, where spotting novel characters helps decipher historical texts. We collected data from Dongba sutras, annotating and summarizing it into a dataset of 3,633 bounding boxes across 253 categories. To accommodate one-shot tasks, experts hand-wrote an additional 253 characters, using them as support images. The dataset is available for download111https://github.com/infinite-hwb/VGTS/tree/master/DATA.

Egyptian Hieroglyph Dataset (EGH) [Franken and van Gemert., 2013] built from hieroglyphs in 10 different pictures from the book ’The Pyramid of Unas’, comprises 171 classes. For this dataset, 4 images are selected for training, while 6 images are used for testing. It notably includes 68 novel categories, not present during the training process.

VML-HD dataset (VML) [Kassis et al., 2017] is a valuable resource for researchers working with historical Arabic documents. It comprises 94 images, with a total of 1,770 unique character categories, of which 150 are randomly selected as novel categories. The training set includes 65 images, and the remaining 29 images are used for testing.

Tripitaka Koreana in Han dataset (TKH) [Yang et al., 2018] consists of scanned images from Chinese antiquarian books. It contains a total of 999 images, with 1,492 different character categories. For this dataset, 171 images are chosen for training, and 828 images are reserved for testing.

Notary Charters dataset (NC) [Seitzer and Christlein., 2018] is a collection of manuscripts from classical and medieval eras, consisting of 388 different categories of notary charters. Among these, 311 categories appear only once, and are thus selected for training. The remaining images are reserved for testing, with 77 categories chosen as novel categories.

4.2 Implementation details

We employ a data augmentation strategy that involves randomly cropping patches of size 800×\times800 to all query images. This approach not only prevents over-fitting, but also allows the model to learn from input query images containing multiple characters and categories. Besides, we iteratively adjusted it based on the results obtained from the validation set and selected the model that achieved the best performance. For the training objective function, the mpos=0.6m_{pos}=0.6 and mneg=0.5m_{neg}=0.5, and the hyperparameter λ\lambda is set to 0.2. Our method is based on PyTorch 1.9 and trained using the Adam optimizer [Kingma and Ba., 2014] with a learning rate of 1e-4 and a beta1 value of 0.9 on one Nvidia GeForce RTX 3090. We consider that the top three blocks of the pre-trained ResNet-50 network [He et al., 2016], used on ImageNet [Russakovsky et al., 2015], are used as the feature extractor. We apply an affine transformation with a 6-degree-of-freedom linear transformation for the geometric matching. To evaluate the performance of our model, we use the standard Pascal VOC metric [Everingham et al., 2010], following the mainstream setting of using mean average precision (mAP), recall and F1 score as the evaluation metrics. We use an intersection-over-union (IoU) threshold of 0.5 for mAP evaluation. All datasets and code will be publicly available222https://github.com/infinite-hwb/VGTS. To mitigate memory strain, the TKH dataset’s test set is divided into Set 1 and Set 2, comprising 390 and 438 test images, respectively. During evaluation, the support image is resized while maintaining its aspect ratio for all comparative methods. A five-level pyramid (0.4, 0.6, 0.8, 1.0, and 1.2 times the dataset scale) is employed for the query image. The DBH dataset has a scale of 2500, the VML dataset has a scale of 4000, the TKH dataset has a scale of 5500, and the NC dataset has a scale of 4000.

4.3 Results

4.3.1 Comparison Baseline

Initially, it is hypothesized that certain one-shot object detection methods could also address the task of one-shot text spotting. To test this hypothesis, we retrained some popular few-shot object detection methods [Fan et al., 2020] using the four datasets previously summarized. Unfortunately, the experimental performance fell short, struggling to surpass 8%\% mAP. This is mainly due to the fact that historical manuscripts usually comprise a large number of characters, frequently exceeding 100 different categories. These characters possess similar glyphs or significantly vary in scale, thereby presenting significant challenges for one-shot text spotting. Although we attempted to use the popular object detector method Faster RCNN [Ren et al., 2017], the paucity of training data made it difficult to achieve meaningful results. Therefore, we opted to use OS2D [Osokin et al., 2020], a dense correlation matching feature-based method for one-shot object detection, as our baseline. This approach trains the transform network from a model pre-trained with weak supervision. As there has been no prior research on one-shot text spotting, we adapted OS2D [Osokin et al., 2020] to suit our historical manuscripts by fine-tuning the method’s parameters and using the same data augmentation scheme as our proposed method. Although CoAE’s intended purpose is to detect objects in natural scenes, it has been included as one of the comparison methods for its efficacy in one-shot object detection using the RPN approach in recent years. To make a fair comparison, we have also added multi-scale pyramids to the training and testing of CoAE. Furthermore, we have experimented with DSW [Wicht et al., 2016], a classic unsupervised approach for text spotting based on the “sliding window” technique, and reproduced its main idea by directly mapping the feature map of the support image as a convolution on the query image after extracting the features using a Siamese network with specifications identical to our method.

Table 1:
Quantitative evaluation on DBH, EGH, VML, TKH and NC on the “Novel” class. The best results are highlighted in bold.
Dataset Method Trained mAP Recall F1
DBH DSW No 51.70 59.64 55.39
CoAE Yes 62.64 60.36 61.48
OS2D Yes 98.08 99.29 98.68
VGTS(Ours) Yes 99.85 100.0 99.92
EGH DSW No 51.88 72.86 60.61
CoAE Yes 12.59 55.06 20.50
OS2D Yes 78.63 91.43 84.55
VGTS(Ours) Yes 83.86 97.62 90.22
VML DSW No 62.77 74.15 67.99
CoAE Yes 39.07 60.17 47.38
OS2D Yes 94.39 98.31 96.31
VGTS(Ours) Yes 100.0 100.0 100.0
TKH Set 1 DSW No 42.63 92.02 58.27
CoAE Yes 36.74 47.30 41.36
OS2D Yes 88.05 99.60 93.47
VGTS(Ours) Yes 90.20 99.65 94.69
TKH Set 2 DSW No 37.95 88.68 53.15
CoAE Yes 35.78 45.73 40.15
OS2D Yes 86.20 98.35 80.55
VGTS(Ours) Yes 88.06 99.52 93.44
NC DSW No 59.60 66.06 62.66
CoAE Yes 85.82 93.58 89.53
OS2D Yes 76.15 77.92 77.02
VGTS(Ours) Yes 92.14 96.79 94.41

4.3.2 Statistical Results and Analysis

The statistical results for the Novel class across various datasets are presented in Table 1. The Novel class denotes an unseen category absent during training. In contrast, the statistical outcomes for the Base class across different datasets are displayed in Table 2, with the Base class signifying a seen category present during training.

Experimental results reveal that our method attains the highest recall score across all datasets, suggesting a low propensity for missed text spotting occurrences. Additionally, our proposed method surpasses the other methods in mAP and F1 scores. It is pertinent to acknowledge that the sliding window method may be deemed an unsupervised technique, and the accurate aspect ratio of support image feature maps is provided for this approach. Nevertheless, the performance of the unsupervised sliding window method is inferior to supervised learning methods, such as OS2D and VGTS. Moreover, the unsupervised DSW method occasionally outperforms the supervised CoAE method, likely due to inadequate data for comprehensive CoAE learning. Regarding data distribution, the DBH, EGH and NC datasets exhibit a more dispersed inter-class character distribution in manuscript images, whereas the VML and TKH datasets display a more compact inter-class distribution and smaller character size, which CoAE struggles to accommodate. With the EGH, VML and TKH datasets containing numerous similar characters, our VGTS method outperforms the baseline method. We also observed that performance on novel categories occasionally exceeds that of base categories, a phenomenon evident across multiple methods. This observation may be attributed to the relatively smaller data volume for novel categories, which can significantly impact their precision and recall rates during evaluation. Specifically, when a method accurately spots limited samples of novel categories, precision may increase, resulting in higher Average Precision (AP) for those categories.

Table 2:
Quantitative evaluation on DBH, VML and TKH on the “Base” class. Note that, the test set of the NC does not contain the base categories, and hence, we only evaluate the other three datasets. The best results are highlighted in bold.
Dataset Method Trained mAP Recall F1
DBH DSW No 39.08 35.76 37.35
CoAE Yes 72.55 93.55 81.72
OS2D Yes 89.22 94.76 91.91
VGTS(Ours) Yes 91.74 95.17 93.42
EGY DSW No 18.44 39.36 25.11
CoAE Yes 13.43 50.48 21.22
OS2D Yes 52.92 77.00 62.73
VGTS(Ours) Yes 76.13 97.19 85.38
VML DSW No 20.70 46.83 28.71
CoAE Yes 27.30 53.07 36.05
OS2D Yes 55.31 52.45 53.84
VGTS(Ours) Yes 66.59 70.57 68.52
TKH Set 1 DSW No 41.70 83.71 55.67
CoAE Yes 58.81 85.92 69.83
OS2D Yes 89.31 95.72 92.40
VGTS(Ours) Yes 90.51 95.78 93.07
TKH Set 2 DSW No 36.22 82.03 50.25
CoAE Yes 54.67 81.74 65.52
OS2D Yes 85.53 95.96 90.45
VGTS(Ours) Yes 87.50 96.26 91.67
Table 3:
Performance evaluation for similar characters, “num” is the number of occurrences for each character.
(a) [Uncaptioned image]”(Novel) and “[Uncaptioned image]”(Base).
Class AP50 # num.
[Uncaptioned image] 94.47 421
[Uncaptioned image] 95.04 11,017
(b) [Uncaptioned image]”(Novel) and “[Uncaptioned image]”(Base).
Class AP50 # num.
[Uncaptioned image] 98.50 186
[Uncaptioned image] 100.0 3
(c) [Uncaptioned image]”(Novel) and “[Uncaptioned image]”(Base).
Class AP50 # num.
[Uncaptioned image] 62.5 2
[Uncaptioned image] 99.41 237
(d) [Uncaptioned image]”(Novel) and “[Uncaptioned image]”(Base).
Class AP50 # num.
[Uncaptioned image] 85.72 72
[Uncaptioned image] 93.33 2

During training, VGTS learns to match potential query image regions with the support character instance rather than accumulating class-specific knowledge. This learning strategy facilitates effective performance with limited training samples, crucial in the context of historical manuscript data featuring small, sparse datasets. By circumventing the need for an extensive training sample collection, the proposed method distinguishes itself from other one-shot methods that require larger sample volumes to achieve similar results.

Similar Characters Challenge. Misidentification can occur due to the structural similarity between characters, particularly when characters in ‘Novel’ categories bear a high resemblance to those in ‘Base’ categories. To assess this, we specifically selected ‘Base’ and ‘Novel’ categories, which consist of morphologically similar characters from the TKH dataset. As indicated in Table 3(d), VGTS maintains high-performance metrics.

Long-tailed Distribution Challenge. A challenge in text spotting in historical manuscript manuscripts is their long-tail distribution, with some characters appearing only once. Data from the TKH training set indicates that over 58% of characters appear less than five times. This poses significant difficulties in training models, especially in capturing features of low-frequency characters. Fig. 5 compares the effectiveness of various methods in addressing this long-tail distribution issue. OS2D outperforms CoAE in high-frequency categories, while VGTS, our category-agnostic method, demonstrates higher mAP across all frequency categories, particularly in low-frequency ones such as ‘[1,5)’ and ‘[5,10)’. This capability of VGTS is crucial for handling long-tail distributions, indicating its effectiveness in processing rare categories.

Refer to caption
Fig. 5: Character frequency distribution in the TKH dataset and comparative mAP scores of different methods.

Rotational Disturbance. Our proposed method demonstrates exceptional robustness against minor rotational variations in test images, eliminating the need for rotation augmentations during training. We subjected test images to rotational disturbances ranging from 0 to 15 at 2.5 intervals. As depicted in Fig. 6, our method surpasses the comparative methods in managing rotated test images, experiencing only minimal performance degradation as the rotation angle increases. Conversely, the RPN-based model, CoAE, exhibits a rapid decline in performance when confronted with rotational disturbances. The OS2D model’s performance reveals a gradual degradation as the query image’s rotation angle increases. The unsupervised sliding window-based method appears to be minimally impacted by rotational disturbances.

Refer to caption
Fig. 6: Quantitative evaluation on DBH Novel with different angles of rotation on query images.

4.3.3 Qualitative Results and Analysis

Spotting Performance Under Varied Image Conditions. The support image quality substantially influences the localization and spotting performance. To demonstrate our method’s robustness under varying image quality conditions, we present visual results with low-quality support images in Fig. 7. Remarkably, our method consistently delivers outstanding performance, even when the support image is blurred, as illustrated in the first row of Fig. 7. Both OS2D and VGTS exhibit accurate character spotting when the support image is affected by illumination, as shown in the second row of Fig. 7.

Refer to caption
Fig. 7: Example results on NC dataset. The leftmost column shows the ground truth, and the red box is the support image. All prediction boxes are shown in green.

We introduce Gaussian noise to the source images to simulate varying degrees of image degradation. The noise is added at variances corresponding to 15%, 25%, and 35% of the full intensity scale, respectively. A qualitative analysis is illustrated in Fig. 8, with error detections highlighted by red dotted boxes. It is observed from Fig. 8 that the model maintains robust performance under low to moderate levels of Gaussian noise. However, when the noise level reaches 25%, the model begins to falter, resulting in instances of both missed and false spotting. We provide a demo video333https://youtu.be/8GRDOxCDMjw 444https://www.bilibili.com/video/BV12x4y1K7MX/ that effectively demonstrates the effectiveness of our proposed method, comprising several brief clips.

Refer to caption
Figure 8: Exemplary results on the VML dataset show the visual impact of different levels of Gaussian noise.

In practical applications, character images with distinct glyph styles may be considered as support images for the spotting procedure. Quantitative analysis of this scenario is challenging; therefore, a qualitative assessment is provided in Fig. 9. The red box delineates the support image, while all predicted boxes are depicted in green. Within Fig. 9, it has been noted that using a support image with a different character style from the query image can produce favorable outcomes, thereby illustrating the model’s generalization capacity and transferability.

Refer to caption
Fig. 9: Exemplary outcomes utilizing the TKH dataset are presented.

Spotting Novel Combination Characters. The proposed approach affords the flexibility to manage diverse and evolving application scenarios by enabling the spotting of novel classes. Specifically, we define a “word” as a combination of characters belonging to a novel class when a single character sample serves as the support image during training. The extensive annotation required to determine which characters can constitute a word renders quantitative experimental comparisons challenging. Consequently, we present qualitative experimental results that illustrate the proposed approach’s effectiveness. Impressively, our method delivers precise spotting performance for combination characters. Fig. 10 showcase the performance of the proposed method on combination characters, emphasizing its capacity to accurately spot novel classes. It is essential to acknowledge that users must manually add unseen characters to the support image gallery for novel character classes, as our approach cannot automatically detect unknown categories in query images.

Refer to caption
Fig. 10: Visualization of the VGTS quality in spotting novel combination characters.

Spotting All Characters. Figs. 11 and 12 offer a glimpse of VGTS’s visualization results on the DBH and EGY datasets and show its adeptness in identifying symbols from ‘Novel’ categories. Categories not part of the training are marked with blue boxes, while those that have been trained are highlighted with green boxes. VGTS possesses the capability to swiftly assist historians in cataloging character categories, this greatly benefits the decryption and study of historical manuscripts.

Refer to caption
Fig. 11: Visualization results on the DBH datasets. Green boxes denote categories encountered during training (‘Base’), blue boxes represent categories that are not present during the training process (‘Novel’).
Refer to caption
Fig. 12: Visualization results on the EGY datasets. Green boxes denote categories encountered during training (‘Base’), blue boxes represent categories that are not present during the training process (‘Novel’).

4.4 Ablation Study

4.4.1 Initialization of the feature extractor

When confronted with a limited amount of historical manuscript data, training the network from scratch proves to be a daunting task. As evidenced by the experimental results presented in Table 4, it becomes clear that the ResNet-50 backbone pre-trained on ImageNet (R50-ImgNet) delivers the best performance. Historical manuscripts in various languages may display unique patterns of texture features. We also tested the ResNet-50 pre-trained on the SynthText dataset [Gupta et al., 2016] (R50-Synth) as the backbone, which similarly achieved high performance. Conversely, utilizing the ResNet-101 architecture did not yield superior results. Based on experimental observations, we opted for R50-ImgNet as the backbone for our model. Additionally, if we freeze the backbone to prevent feature learning, the model still performs quite well, achieving a 90.81%\% mAP on DBH Novel, as seen in Table 5.

Table 4:
Ablation study of feature extractor and test on the novel categories of the DBH dataset. The best results are highlighted in bold.
Component Novel Base
mAP Recall F1 mAP Recall F1
R50-ImgNet 99.85 100.0 99.92 91.74 95.17 93.42
R50-Synth 99.47 98.64 99.05 88.58 93.42 90.94
R101-ImgNet 97.03 99.29 98.15 86.30 92.88 89.47
Table 5:
Ablation study of freezing backbone and test on the novel categories of the DBH dataset. The best results are highlighted in bold.
Strategy Novel Base
mAP Recall F1 mAP Recall F1
Unfreezing 99.85 100.0 99.92 91.74 95.17 93.42
Freezing 90.81 98.21 94.37 75.75 82.16 78.82
Table 6:
Performance comparison on different attention design methods of VGTS on the DBH dataset. The best results are highlighted in bold.
Configuration Novel Base
mAP Recall F1 mAP Recall F1
Support-First Att. 99.85 100.0 99.92 91.74 95.17 93.42
Query-First Att. 98.73 98.93 98.83 86.76 89.69 88.20
No Support Att. 99.21 100.0 99.60 87.89 89.77 88.82
No Query Att. 96.10 97.86 96.97 89.70 93.45 91.54
No Dual Att. 96.65 99.29 97.95 83.80 87.64 85.68
Refer to caption
Fig. 13: The experiment of subsampling TKH categories during training, and testing on TKH Novel.
Table 7:
For testing the spotting loss function, we try several different settings of the loss function in the same architecture. We show the performance on the four datasets with novel class. The best results are highlighted in bold.
Loss Function DBH VML TKH Set 1 TKH Set 2 NC
mAP Recall F1 mAP Recall F1 mAP Recall F1 mAP Recall F1 mAP Recall F1
Triplet 79.25 86.79 82.85 60.00 86.86 70.97 71.28 97.75 82.44 62.19 95.32 75.27 88.83 95.41 92.00
Contrastive 92.11 97.14 94.56 98.23 98.73 98.48 75.79 99.05 85.87 69.43 97.48 81.10 90.60 94.50 92.51
RL 99.47 99.64 99.55 98.73 99.58 99.15 90.05 99.60 94.58 86.99 98.13 92.22 90.77 95.72 93.17
Torus (Ours) 99.85 100.0 99.92 100.0 100.0 100.0 90.20 99.65 94.69 88.06 99.52 93.44 92.14 96.79 94.41

4.4.2 Influence of Attention Design

To empirically validate the effectiveness of our proposed framework, we conduct a comprehensive ablation study on the dual spatial attention block. Specifically, we perform one of the following operations at a time: (1) interchanging the order of the two spatial attention operations, and (2) removing one of the spatial attention operations. We evaluate all the models on the DBH dataset, and the results are presented in Table 6. The human cognitive process typically involves examining the support image first and then seeking relevant similar regions in the query image. By altering the order of spatial attention in VGTS, with query spatial attention preceding support spatial attention, we observe a decrease in mAP scores by approximately 0.7%\% and 3%\% for the novel and base categories, respectively. The experimental results attest to the effectiveness of the proposed module sequence. Eliminating the spatial attention for support images resulted in a decrease of about 1.7%\% in mAP scores for the base category, and removing the spatial attention for query images led to approximately a 3.4%\% decrease in mAP scores for the novel category, providing evidence that the spatial attention for both support and query images plays a crucial role in achieving stable and significant improvements.

4.4.3 Subsampling Experiment

In contrast to widely-used object detection datasets such as COCO [Lin et al., 2014] and VOC [Everingham et al., 2010], historical manuscript images present unique challenges, including smaller sample sizes, a larger number of categories, and considerable similarities between categories. To assess the efficacy of our approach in low-resource scenarios, we conduct experiments by varying the number of base categories in the training set while maintaining the number of input images constant. Fig. 13 presents a performance comparison between our proposed method and OS2D on the TKH dataset, showcasing results for varying numbers of training set categories, ranging from 600 to 1392. As illustrated in Fig. 13, the number of categories in the training set significantly influences the performance outcomes. Specifically, when the number of participating categories is reduced by 792, there is a noticeable impact on the F1 scores. For OS2D, this reduction leads to a decrease of 4.83% on the TKH Set 1 and 4.32% on the TKH Set 2. In contrast, for our VGTS method, the F1 scores decline by 2.87% on the TKH Set 1 and 3.06% on the TKH Set 2.

To investigate the influence of the number of input images on the model’s performance, we further evaluate the model on training sets comprising 40, 80, 120, and 171 images, corresponding to a total of 1104, 1232, 1320, and 1942 categories, respectively. The findings depicted in Fig. 13 indicate a positive correlation between the number of input images and model performance, as evidenced by higher F1 scores with larger image datasets. Notably, our method demonstrates robustness even in low-resource conditions, achieving F1 scores of over 92.2% and 89.78% on the TKH Set 1 and TKH Set 2 datasets, respectively, when trained with just 40 images.

4.4.4 Selection of Loss Function

The loss function comprises both localization and spotting components. For localization, we employ the smooth L1L_{1} loss, while for the spotting loss, we assess the effectiveness of our proposed new loss by testing several different configurations of the loss function within the same architecture. Our results in Table 7 demonstrate that the triplet loss function does not yield substantial performance improvements, whereas the contrastive loss function surpasses the triplet loss function. The ranked list loss, which can adaptively balance positive and negative samples, achieves commendable performance. However, the proposed torus loss function outperforms all other loss functions. This can be attributed to its ability to focus on challenging samples in the margin gap and balance positive and negative samples, which facilitates learning distance metrics and enhances text spotting in low-resource scenarios. The experimental results on all four datasets using the torus loss function corroborate its superiority compared to other loss functions, thus validating the effectiveness of our proposed approach.

Table 8:
Cross-domain performance measures of our method when training is done on different datasets and testing on DBH Novel.
Metric Training Dataset
DBH EGY VML TKH NC
mAP 99.85 97.79 98.44 92.90 58.23
Recall 100.0 99.64 99.29 98.57 61.79
F1 99.92 98.71 98.86 95.65 59.96

4.4.5 Cross-domain Performance

We have conducted a comprehensive evaluation of the cross-domain performance of our proposed model by training it on various datasets and selecting the DBH Novel dataset for testing purposes. The experimental results are presented in Table 8, demonstrating the effectiveness of our proposed model. For example, when the model is trained on the VML dataset, the DBH Novel dataset can be considered as a novel class. However, the model’s optimal performance is obtained when it is trained and tested on the same dataset, as the data distributions are similar. Nevertheless, cross-domain training may result in a decline in performance, particularly when the data types differ significantly, and a limited number of samples are used in the training process.

5 Conclusions

Existing object detection methods struggle to identify classes that are not part of the training, particularly in low-resource scenarios. Addressing this challenge, this paper presents a novel approach for page-level text spotting in historical manuscripts using deep learning techniques, referred to as VGTS, that effectively addresses the low-resource open-set problem. The proposed method leverages dual spatial attention and correlation matching to align support and query images, closely emulating the human cognitive process. The approach enables the spotting of new classes without additional training and is particularly suitable for low-resource scenarios. To address example imbalance, we introduce a novel loss function called torus loss that makes the embedding space of distance metric more discriminative. Our experimental results demonstrate that our approach significantly outperforms existing methods, thereby offering promising applications in this field.

Acknowledgments

This work was jointly supported by the National Natural Science Foundation of China under Grant No. 62176091, the National Key Research and Development Program of China under Grant No. 2020AAA0107903.

Annex. The Challenges of Dongba Script Research.

The Dongba script, a unique hieroglyphic language developed by the Naxi minority’s ancestors in China, represents a vital part of cultural heritage and historical linguistics. Despite its importance, researchers in the field face several formidable challenges in deciphering and cataloging this ancient script. The challenges encountered in the study of Dongba script can be summarized as follows:

Refer to caption
Figure 14: Manuscript notes of a historian studying Dongba texts: newly discovered characters categorized and annotated.
  1. 1.

    Resource Limitations and Manual Deciphering: Given the scarcity of resources, researchers often engage in manual identification and documentation of new symbols in Dongba manuscripts. This labor-intensive process demands high levels of accuracy and attention to detail, as illustrated in Fig. 14, which shows the painstaking process a scholar undergoes when documenting insights from Dongba manuscripts.

  2. 2.

    Human Factor in Interpretation: The manual nature of this work introduces the risk of inefficiencies, potential misinterpretations, and inaccuracies. These human factors can significantly impact the reliability of the research findings.

  3. 3.

    Discovery of Novel Characters: One of the most significant challenges in Dongba script research is the ongoing discovery of new characters and markings. This evolving nature of the script continually adds complexity to the research process.

  4. 4.

    Technological Limitations: Traditional object detection methods, typically designed under a closed-set assumption, struggle to recognize these newly discovered characters. Adapting these methods requires constant model retraining, which can be resource-intensive and may detract from the primary research focus.

References