Deformable Kernel Expansion Model for Efficient Arbitrary-shaped Scene Text Detection

Tao He
[email protected] Sheng Huang
[email protected] Wenhao Tang
[email protected] Bo Liu
[email protected]

Abstract

Scene text detection is a challenging computer vision task due to the high variation in text shapes and ratios. In this work, we propose a scene text detector named Deformable Kernel Expansion (DKE), which incorporates the merits of both segmentation and contour-based detectors. DKE employs a segmentation module to segment the shrunken text region as the text kernel, then expands the text kernel contour to obtain text boundary by regressing the vertex-wise offsets. Generating the text kernel by segmentation enables DKE to inherit the arbitrary-shaped text region modeling capability of segmentation-based detectors. Regressing the kernel contour with some sampled vertices enables DKE to avoid the complicated pixel-level post-processing and better learn contour deformation as the contour-based detectors. Moreover, we propose an Optimal Bipartite Graph Matching Loss (OBGML) that measures the matching error between the predicted contour and the ground truth, which efficiently minimizes the global contour matching distance. Extensive experiments on CTW1500, Total-Text, MSRA-TD500, and ICDAR2015 demonstrate that DKE achieves a good tradeoff between accuracy and efficiency in scene text detection.

Refer to caption — Figure 1: The accuracy and inference speeds of several top-performance scene text detectors on the Total-Text dataset, where DB, R50 and R18 are DBNet, ResNet50 and ResNet18 respectively. All models are tested as a Two-Step text detector, which outputs polygons as their final results. For more details, please see Sec.4.2.

1 Introduction

Scene text detection (STD) [18] has received increasing attention from academia and industry for wide applications such as intelligent transportation, cross-language information retrieval, visual search, and blind auxiliary. Precisely localizing the region of text instances in natural images is essential for improving or achieving such applications. With the development of text detection methods, how to detect text instances in scene images accurately and efficiently remains a significant challenge for researchers.

The significant challenges in STD are handling the variety of text shapes, scales, and extreme aspect ratios of text regions. Segmentation-based detectors [28, 26, 30, 29, 11, 12]) output pixel-wise predictions based on local texture, which is proven effective to the above challenges. Compared with other approaches (e.g., regression-based and contour-based methods), they possess great advantages in speed while maintaining similar accuracy. However, the local nature makes segmentation-based method reply on complicated post-processing to obtain the text boundary. For example, TextSnake [19] proposes a striding algorithm which does vertex-wise boundary detection along the central line of the response map. PSENet [28] generates multi-scale kernel representations of text regions and adopts a progressive BFS-based scale expansion algorithm to merge regions at the pixel level. Several recent segmentation-based STD models improve the inference efficiency by simplifying the post-processing, such as the differentiable binarization model [12]. However, their accuracy heavily relies on the segmentation results. The lack of contour adjustment mechanisms leads to ineffective detection if the segmentation has a large deviation.

Motivated by contour-based methods [13, 22, 17, 39] in instance segmentation, another branch of STD models discrete the text contour as a series of discrete vertices and directly predicts the coordinates of those vertices. The adaptive text region representation model [31] uses a text region proposal network to obtain text region proposals. Then RNN model is learned to predict the coordinates of boundary vertices as a sequence. The TextBPN model [38] learns an adaptive deformation model for text boundary vertices adjusting. Although achieving impressive boundary accuracy, this type of model usually needs to do vertex-wise adjusting iteratively or sequentially, leading to an inefficient inference process.

In this work, we propose a novel STD model named Deformable Kernel Expansion (DKE) which leverages the merits of both segmentation and contour-based STD methods. Specifically, we extract dense contour vertices from the shrunken text kernel and the annotated text boundary for Deformable Contour Expansion (DCE) module learning. This methodology is different from most contour-based STD models that adjust coarse text boundaries to get more accurate text boundaries, which is more efficient. Because the text kernel, as a more centralized text region, has less noise than text boundaries. One technical issue of contour expansion is to find the vertex pairing relations between predicted vertices and annotated boundaries for a contour deformation loss calculation. Existing contour-based models usually pair them in a fixed manner, regardless of the continuous position adjustment of the predicted vertex. We propose a vertex pairing strategy based on minimizing the deformation cost from a global perspective and present a novel contour deformation loss named the Optimal Bipartite Graph Matching Loss (OBGML). Benefiting from DKE and OBGML, our detector achieves superior or competitive results compared to others on the curved and oriented text benchmarks. As shown in Fig.1, we achieve the best tradeoff between accuracy and efficiency compared with recent top-performance scene text detectors on detecting arbitrary shape texts. The technical contribution of the proposed model can be summarized as follows:

$\bullet$

We propose an efficient scene text detector named DKE. After the initial text segmentation and text kernel extraction, post-processing is performed by a learnable regression network. The regression network deforms the text kernel via contour deformation in one step, making the generation of text boundary fast and accurate.
$\bullet$

We propose to use the contour of the text kernel instead of the text region to initialize the scene text contour deformation. Such initialization is less affected by the noise while enabling the deformation to be accomplished in a single iteration.
$\bullet$

We formulate the contour vertices pairing between prediction and the ground truth as a bipartite graph matching problem for solution and define a novel contour deformation loss named OBGML. It enables minimizing the total distance of deforming predicted contours to annotated boundaries globally and further improving the performances of DKE.

2 Related Work

Scene text detection methods have developed rapidly in recent years. In the era of deep learning, there exist two popular branches: segmentation-based methods and contour-based methods. They are widely studied for their ability to represent text instances with arbitrary shapes.

Segmentation-based Detector: Segmentation-based methods[19, 20, 9, 28, 26, 30, 11, 12] often predict text regions, or basic components of scene texts, such as text kernels [28, 26, 30, 11], text center region [19], to group pixels into different text instances. The most significant advantage of segmentation-based methods is the flexible representation of text instances with arbitrary shapes. The pipeline of those methods originated from instance segmentation methods. Inherited from the instance segmentation method Mask R-CNN [4], Mask TextSpotter [20] detected scene texts by instance segmentation which enables it to detect arbitrary-shaped text. To effectively separate adjacent texts, PSENet [28] adopted a progressive scale algorithm to gradually expand the pre-defined text kernels. PAN [30] replaced the post-processing of PSENet with a learnable module to predict similarity vectors of pixels to achieve better performance. DB [11] simplified the post-processing by directly expanding the kernel contours to speed up the detection. However, the lack of contour adjustment mechanisms makes DB[11] vulnerable to the deviation of segmentation results.

Contour-based Detector: Contour-based methods [31, 41, 33, 2, 38] deem scene text detection as a regression task. Unlike many regression-based methods [42, 10, 6, 15], those methods usually directly predict vertices on text boundaries. ContourNet [33] used a series of discrete vertices to represent text boundaries. It adopted a local orthogonal texture-aware module to model the local texture information of text proposals and then predicted vertices on text region boundaries. TextBPN [38] generated boundary proposals for text regions by a boundary proposal model and then adopted an adaptive boundary deformation model to refine the coarse boundaries iteratively. Although each iteration of contour refining is quite efficient, the contour-based methods often need to refine the coarse contour several times, which slows down the overall inference speed. Generally, the contour-based methods have a gap with the segmentation-based methods in inference speed.

In real-world applications, a good scene text detector should be not only accurate but efficiency. Some works consider both of these two aspects to elaborate scene text detectors. PAN [30] is proposed to adopt a low computational-cost segmentation head and learnable post-processing for scene text detection, which achieves a good balance between accuracy and efficient. CT [23] reconstructs text boundaries using heuristics based on text kernels and centripetal shifts. This process can be calculated in parallel by implementing one matrix operation, guaranteeing good efficiency without losing accuracy. Even though they are considerably fast in clustering text instances at the pixel level, the contour extraction procedure takes more than half of their inference time. This makes them not suitable for applications that require text boundaries as outputs.

Our method incorporates both segmentation-based and contour-based methods. It employs the segmentation-based methodology for generating text kernels as high-quality initial detection boundaries while regressing the final detection contour via a contour deformation module. Therefore, our method inherits the good text representation ability of the segmentation-based methodology and the contour deformation efficiency of contour-based methodology in each iteration. In other words, our method can introduce a good tradeoff between accuracy and efficiency.

3 Methodology

The Deformable Kernel Expansion (DKE) model consists of two core modules, namely Text Kernel Generation (TKG) and Deformable Contour Expansion (DCE). The TKG step aims at generating a high-quality text kernel to provide a good initial boundary, while the DCE step aims at learning to expand the kernel contours for obtaining the final detection.

3.1 Network Architecture

The entire architecture of our model is summarized as Fig.2. Following the conventions [28, 26, 30, 11], the ResNet [5] is employed to extract multi-scale features, and then these features are fused as the contextual feature $f$ by a feature-pyramid network [40]. This step can be mathematically denoted as $f=F(I)$ , where $F(\cdot)$ is the hybrid mapping of the ResNet and the feature-pyramid network, and $I$ is an input image.

In the TKG step, we follow the pipeline of segmentation-based scene text detection methods, and leverage a lightweight segmentation network $S(\cdot)$ as an initial detection head to predict the probability map $\hat{Y}=S(f)$ , which encodes text kernel $\mathcal{K}$ at the pixel level. We uniformly sample $N$ vertices on the kernel boundary as the contour representation, which is denoted as $P^{\mathcal{K}}=\{p_{i}:=(x_{i},y_{j})\}^{N}_{i=1}$ , and used as the initial detection boundary. A contour composed of $N$ (e.g., $N=128$ ) vertices is sufficient to describe most of the instances well [22]. Moreover, these vertices are ordered in certain rules, which will be introduced in Sec.3.2. By default, $P^{\mathcal{K}}$ mentioned in our literature is the re-ranked vertex collection.

In the DCE step, we obtain the convolution features for each vertex in $P^{\mathcal{K}}$ by retrieving them in the feature $f$ based on its coordinates. The previous method [14] indicates that learning the coordinate mapping between Cartesian space and the pixel space is challenging. Therefore, we follow some recent works [32, 16], which append the relative coordinates of vertices at the end of convolution features as extra information cues, to alleviate this issue. In our method, the upper left corner of the minimum bounding box is considered the origin of the coordinate system for calculating the relative coordinates. The aforementioned hybrid feature processing is formulated as follows,

U=\omega(P^{\mathcal{K}},f)=\omega(\{p_{i}\}^{N}_{i=1},f),

(1)

where $U:=\{u_{i}\}_{i=1}^{N}$ are the obtained vertex features and $\omega(\cdot,\cdot)$ is the mapping function of the aforementioned process.

The goal of the DCE step is to learn the mapping between these vertex representations and the offsets of vertices on the kernel contour to the ones on the boundary of the final detection box with a contour deformation network[22] $D(\cdot)$ ,

\triangle G^{\mathcal{K}}:=\{(\triangle x_{i},\triangle y_{i})\}_{i=1}^{N}=D(U)=D(\{u_{i}\}_{i=1}^{N}),

(2)

where $\triangle G^{\mathcal{K}}$ are the learned coordinate offsets of $P^{\mathcal{K}}$ to the ground truth $G^{\mathcal{K}}$ . Then the predicted coordinates of the corresponding vertices on the final bounding box can be obtained as follows,

\hat{G}^{\mathcal{K}}=P^{\mathcal{K}}+\triangle G^{\mathcal{K}}.

(3)

We adopt the small text kernel, because it is easier to separate two adjacent text regions and less likely disturbed by the background. Therefore, the contour deformation process is actually a kernel expansion process.

3.2 Kernel Annotation and Vertex Sorting

In the training phase, we need to generate the text kernel and the contour ground truth for each training example based on its annotated text boundary. We follow the previous works [28, 11], and shrink the annotated text boundary into the text kernel via the Vatti clipping algorithm [27]. This algorithm is able to calculate the shrinking margin $m$ between the shrunken kernel contour and the original text boundary based on a given shrink ratio. In our experiments, we empirically set the shrink ratio to 0.4 for generating the small text kernel.

After that, we uniformly sample $N$ vertices on the generated text kernel boundary for depicting the kernel contour. These sampled vertices are sorted in a clockwise manner, and the vertex that is the closest point to the upper left corner of the minimum surrounding rectangle of the text kernel is deemed as the first vertex.

Let $\Phi(\cdot)$ be such a vertex sample and sorting operation. The kernel boundary can be denoted as a sorted vertex collection as mentioned, $P^{\mathcal{K}}=\{p_{i}:=(x_{i},y_{j})\}^{N}_{i=1}=\Phi(\tilde{Y})$ , where $\tilde{Y}$ can be the text kernel of a testing example generated by the segmentation network $\hat{Y}$ or the pre-processed text kernel of a training example (the ground truth) $Y$ .

3.3 Model Optimization

The Deformable Kernel Expansion (DKE) model accomplishes accurate text detection from two aspects. On the one hand, DKE generates text kernels for an input image via a segmentation network. The kernels are used to capture the shape characteristics of text regions. On the other hand, a regression network is applied to learn the contour deformation of text kernels to the final detection boundaries only with several sampled vertices. In such a manner, the DKE model should simultaneously minimize the text kernel generation loss and the deformable contour expansion loss to achieve the above goals.

3.3.1 Text Kernel Generation

In this step, the feature of a scene text image is fed into a segmentation-based detection head to obtain the pixel classification probability map $\hat{Y}=S(f)$ , and then we use the binary cross-entropy for measuring the discrepancy between this generated kernel and its ground truth. The loss of this process can be mathematically denoted as follows,

\mathcal{L}_{s}=-\sum_{i\in\mathcal{A}}Y_{i}\log\hat{Y}_{i}+(1-Y_{i})\log(1-\hat{Y}_{i}),

(4)

where $\hat{Y}_{i}$ is the predicted probability corresponding to the $i$ -the pixel and $Y_{i}$ is its ground truth. Note, not all pixels will take part in the loss calculation. We only measure the prediction discrepancies of pixels in a pixel sub-set $\mathcal{A}$ . We follow the previous works [11, 23] for applying this trick to alleviate the extreme imbalance between the positive and negative pixels where the pixels in the text region are deemed as positives while the reminder is regarded as negatives. In $\mathcal{A}$ , all the positives will be preserved, while only some of negatives will be collected via a hard negative mining strategy, which will keep the ratio of positive to negative being 1:3.

3.3.2 Deformable Contour Expansion

The Deformable Contour Expansion is able to be achieved by predicting the offsets of vertices on the text kernel contour to the corresponding ones on the text boundary with a regression network [22] as depicted in Equation 2. Note, in the training phase, the generated text kernel from annotations is leveraged for learning the contour deformation.

Most contour-based methods conduct the vertex pairing in two strategies: direct matching [41, 22, 38] and nearest neighbor matching [39]. The first strategy does not conduct any vertex alignment and directly applies smooth $L_{1}$ loss to measure the pre-fixed vertex sequence discrepancy. The previous work [39] indicates that the absence of vertex alignment will bring problems like slow convergence and even wrong prediction. They suggest aligning the predicted vertex sequence with the ground truth through a nearest-matching principle. To distinguish the losses of these two strategies, we name the contour deformation losses of these two approaches Direct Matching Loss (DML) and Nearest Neighbor Matching Loss (NNML), respectively.

NNML essentially employs the greedy rule to find the correspondence of the predicted vertex in the ground truth, which cannot guarantee the optimal vertices pairing for kernel expansion process. Moreover, it cannot make sure all the vertices in the ground truth can be used in the optimization for accomplishing the expansion task, as shown in Figure 4. To address the above issues, we deem the vertex pairing problem as a bipartite graph matching problem for solution, and propose the Optimal Bipartite Graph Matching Loss (OBGML) for optimizing the deformable contour expansion.

The goal of the vertex pairing is to find a unique vertex in the ground truth as the correspondence for each predicted vertex that the total squared Euclidean distances between the predicted vertices and their correspondences are minimized. From the perspective of graph theory, the predicted vertices and their ground truths are two disjoint vertex sets, which can be depicted as a bipartite graph. Therefore, the aforementioned vertex pairing issue is actually a bipartite graph matching problem. This problem can be solved by the Hungary algorithm [8], and the global optimal solution enables to be obtained. A $N\times N$ -dimensional vertex-vertex incidence matrix $M$ is constructed for encoding such a bipartite graph. The $ij$ -th element of this matrix is the squared Euclidean distance between the $i$ -th predicted vertices and the $j$ -th real vertex,

M_{ij}=||\hat{G}_{i}^{\mathcal{K}}-{G}_{j}^{\mathcal{K}}||_{2}^{2}.

(5)

Then, the Hungary algorithm[8] is employed to find the best bipartite graph matching based on this incidence matrix,

\mathcal{H}:=\{(i,j^{*})\}_{i=1}^{N}\in\mathcal{R}^{N\times 2}={\rm Hungarian}(M),

(6)

where $\mathcal{H}$ is the obtained index set of predicted vertices and their correspondences. Finally, the OBGML is formulated as follows,

\mathcal{L}_{r}=\frac{1}{N}\sum_{i=1}^{N}\rm smooth_{L_{1}}(\hat{G}_{i}^{\mathcal{K}}-G_{j^{*}}^{\mathcal{K}}).

(7)

In such a manner, the DKE model can be obtained via minimizing the aforementioned two losses,

\{\hat{F},\hat{S},\hat{D}\}\leftarrow\arg\underset{F,S,D}{\min}~{}\mathcal{L}:=\mathcal{L}_{s}+\lambda\mathcal{L}_{r},

(8)

where $\lambda$ is a tunable hyper-parameter for reconciling two losses. According to the numeric values of the losses, $\lambda$ is set to 0.25 in all of our experiments.

3.4 Inference

In the inference phase, we can use the obtained networks $\hat{F}$ , $\hat{S}$ ,and $\hat{D}$ for extracting the features, generating the text kernels and predicting the contour deformation offsets for an image $I$ as follows,

\triangle G^{\mathcal{K}}=\hat{D}(\omega(\phi(\hat{S}(\hat{F}(I))),\hat{F}(I))).

(9)

Then, the final detection boundary depicted by $N$ vertices is produced as follows,

\hat{G}^{\mathcal{K}}:=\{(\hat{x}_{i},\hat{y}_{i})\}_{i=1}^{N}=\phi(\hat{S}(\hat{F}(I)))+\triangle G^{\mathcal{K}}.

(10)

4 Experiments

4.1 Datasets

SynthText [3] is a synthetic dataset consisting of more than 800 $k$ synthetic images. It is only used to pre-train our model.

MLT-2017[21] is a multi-lingual dataset that contains 7200 training images, 1800 validation images, and 9000 test images. Those images are annotated with quadrilateral boxes at the word level. The training and validation images are used to pre-train our model.

ICDAR2015 [7] dataset consists of 1000 training images and 500 testing images, which are captured by Google Glass. Most of the including text instances are severely distorted or blurred. The text instances are labeled at the word level with quadrilateral boxes.

MSRA-TD500 [35] dataset is a multi-language dataset that includes English and Chinese text instances. There are 300 training images and 200 testing images. The text instances are all labeled at the text-line level.

Total-Text [1] is a curved text dataset including 1255 training and 300 testing images. It contains horizontal, multi-oriented, and curve text instances labeled at the word level.

CTW1500 [36] is another curved text dataset, including 1000 training images and 500 testing images. It contains both English and Chinese texts annotated at the text-line level with polygons.

4.2 Implementation details

FPN [40] with ResNet [5] is used as our backbone. Following previous works [28, 30, 37, 29], we first pre-train our models on external text datasets (e.g., SynthText and MLT-2017). Following a poly learning rate policy[40], we fine-tune the pre-trained models on several related real-world datasets with the initial learning rate $2\times 10^{-4}$ and the power 0.9. All models are optimized by the Adam optimizer with the batch size of 16 on two RTX 3090 GPUs.

To make fair comparisons in inference speed between different methods, we set up a platform to test some top-performance detectors with the same strategy. Our system consists of one RTX-3090 GPU and one Ryzen-5700X CPU. Considering that the performance of the model will affect the time consumption of post-processing, we use official weights provided instead of training them by ourselves. All tests are processed in a single thread with a batch size of 1. More details can be referred to Appendix.

4.3 Ablation study

Discussion of Contour Deformation Strategy: Recent segmentation-based text detection methods, like [30, 23, 11, 12], share a similar pipeline in locating the text kernel. DB [11] leverages the Vatti clipping algorithm [27] as a simple post-processing method to generate the final detection contour based on segmentation results. The main merit of this method is efficiency, but it greatly relies on the accuracy of segmentation results, neglecting adaptively adjusting detected contours. We conduct several experiments under this setting to compare with our proposed contour deformation module DCE, which employs a network for learning the deformation from the kernel boundary to the final detection boundary. According to the results tabulated in Tab.1, DCE obtains 2.9% and 1.5% performance gains over baseline in F-measure on Total-Text and CTW1500, respectively, only with a marginal drop in FPS. These results imply that it is worthwhile to incorporate the contour deformation part together with the kernel segmentation for optimization. And learnable kernel expansion module enables outputting more accurate and robust detection boundaries than expanding kernel in a fixed manner. Moreover, our proposed optimal bipartite graph matching loss contributes additional 0.9% and 1.0% performance gains in F-measure on Total-Text and CTW1500, respectively, while maintaining the same detection efficiency. This phenomenon validates the importance of contour deformation loss in optimization. The contour deformation losses will be discussed in detail in the next section.

Datasets	Method	P	R	F	FPS
Total-Text	Baseline	89.3	79.5	84.1	46
	+DCE	89.9	84.5	87.0	41
	+DCE*	90.8	85.1	87.9	41
CTW1500	Baseline	84.0	83.0	83.5	47
	+DCE	87.4	82.6	84.9	41
	+DCE*	86.6	85.2	85.9	41

Table 1: Ablation study of different contour deformation strategies. Baseline indicates the contour deformation based on Vatti clipping algorithm. DCE indicates our contour deformation named deformable contour expansion, which adopts the simple direct matching loss. DCE* indicates the DCE version, which adopts our proposed optimal bipartite graph matching loss. P is the precision, R is the recall, F is the F-measure, and FPS is the frames per second.

Discussion of Contour Deformation Loss: We have employed three contour deformation losses, namely Direct Matching Loss (DML), Nearest Neighbor Matching Loss (NNML), and Optimal Bipartite Graph Matching Loss (OBGML), to our Deformable Kernel Expansion (DKE) model. DML does not need to conduct any further vertex pairing. The latter two losses apply different vertex pairing strategies to calculate loss. NNML employs the nearest neighbor search strategy to find the corresponding vertex in ground truth for each predicted vertex. While our proposed OBGML pairs vertex sequences by considering the vertex pairing issue as a bipartite graph matching problem for solution, which guarantees the global optimum. Fig.6 demonstrates the performances in F-measure under the same settings. The results show that OBGML outperforms the other two losses on several settings and datasets. We attribute this performance to the superior bipartite graph matching-based vertex pairing strategy. Surprisingly, NNML performs much worse than the most simple and common loss DML when applied for kernel expansion. We attribute its fail to the fact that there is a big difference between our practical scenario and the one designed for NNML. In the process of expansion, predicted vertices are optimized towards the longer side of the text region, which caused the lossing of key ground-truth vertices and complete shape information as shown in Fig.4.

Datasets	Iter	P	R	F	FPS
Total-Text	Iter. 1	90.8	85.1	87.9	41
Total-Text	Iter. 2	90.2	84.6	87.3	36
CTW1500	Iter. 1	86.6	85.2	85.9	41
CTW1500	Iter. 2	86.4	84.1	85.2	37

Table 2: The results of our method in each iteration.

			TD500			CTW1500			Total-Text
Method	Venue	Backbone	P	R	F	P	R	F	P	R	F	FPS
PAN [30]	ICCV’2019	R18	84.4	83.8	84.1	86.4	81.2	83.7	89.3	81.0	85.0	-
CT [23]	NeurIPS’2021	R18	90.0	82.5	86.1	88.3	79.9	83.9	90.5	82.5	86.3	47.3
DBNet [11]	AAAI’2020	R18^†	90.4	76.3	82.8	84.8	77.5	81.0	88.3	77.9	82.8	81.1
DBNet++ [12]	TPAMI’2022	R18^†	87.9	82.5	85.1	86.7	81.3	83.9	87.4	79.6	83.3	75.1
DKE(Ours)	-	R18	87.9	83.1	85.4	86.9	82.2	84.5	88.2	83.7	85.9	67.2
SAE [26]	CVPR’2019	R50	84.2	81.7	82.9	82.7	77.8	80.1	-	-	-	-
PSENet-1s [28]	CVPR’2019	R50	-	-	-	82.5	79.9	81.2	-	-	-	-
SPCNet [34]	AAAI’2019	R50	-	-	-	-	-	-	83.0	82.8	82.9	-
CounterNet [33]	CVPR’2020	R50	-	-	-	83.7	84.1	83.9	86.9	83.9	85.4	-
DBNet [11]	AAAI’2020	R50^†	91.5	79.2	84.9	86.9	80.2	83.4	87.1	82.5	84.7	42.2
FCENet [44]	CVPR’2021	R50^†	-	-	-	87.6	83.4	85.5	89.3	82.5	85.8	-
TextBPN [38]	ICCV’2021	R50	86.6	84.5	85.6	86.5	83.6	85.0	90.7	85.2	87.9	17.2
DBNet++ [12]	TPAMI’2022	R50^†	91.5	83.3	87.2	87.9	82.8	85.3	88.9	83.2	86.0	40.8
FewNet [25]	CVPR’2022	R50	91.6	84.8	88.1	88.1	82.4	85.2	90.7	85.7	88.1	-
DKE(Ours)	-	R50	91.7	85.9	88.7	86.6	85.2	85.9	90.8	85.1	87.9	41.0

Table 3: Detection results on the MSRA-TD500, CTW1500, and Total-Text datasets.

{\dagger}

indicates that the backbones have adopted the modulated deformable convolutions [43]. The best and second-best F-measures are highlighted in red and blue, respectively.

Iterations for Contour Deformation: Traditional contour-based scene text detection methods[41, 38, 2] often need to iteratively adjust the coarse predicted boundaries several times. The initial inputs predicted by them often target at text region boundaries. Different from them, DCE module applies the kernel expansion to model the contour deformation. The local features on text boundaries are less reliable than the ones on the kernel boundary, which are located at the more central area of the text region. The reliable features speed up the optimization process, and only a few iterations of contour adjustment can guarantee good convergence. The observations in Tab.2 well validate this. Tab.2 reports the performances of our model in the first two iterations on Total-text and CTW1500 datasets. The results reveal that our model can obtain an outstanding performance in the first iteration, but more iterations degenerate the performances.

Method	Venue	P	R	F
Seglink [24]	CVPR’2017	73.1	76.8	75.0
TextSnake [19]	ECCV’2018	84.9	80.4	82.6
SAE [26]	CVPR’2019	88.3	85.0	86.6
PSENet-1s [26]	CVPR’2019	88.7	85.5	87.1
PAN [30]	ICCV’2019	84.0	81.9	82.9
SPCNet [34]	AAAI’2019	88.7	85.8	87.2
CounterNet [33]	CVPR’2020	87.6	86.1	86.9
DBNet [11]	AAAI’2020	91.8	83.2	87.3
DRRG [37]	CVPR’2020	88.5	84.6	86.5
FCENet [44]	CVPR’2021	90.1	82.6	86.2
DBNet++ [12]	TPAMI’2022	90.9	83.9	87.3
FewNet [25]	CVPR’2022	90.9	87.3	89.1
DKE(Ours)	-	91.0	84.3	87.5

Table 4: Detection results on the ICDAR2015 dataset.

4.4 Comparisons with Previous Methods

Curved text detection. Tab.3 reports the scene text detection performances on two curved text datasets named Total-Text and CTW1500. Our approach achieves the best and the second-best performances in F-measure on CTW500 and Totat-Text, respectively. For example, compared with DBNet++, which is known as one of the most influential and efficient segmentation-based approaches, our method almost enjoys the same efficiency. However, our method outperforms it by 2.6% and 1.9% in F-measure on Total-Text when the backbone is ResNet18 and ResNet50, respectively. This gain on CTW500 is 0.6%. CT is also a segmentation-based approach and performs very well on all two datasets when the backbone is ResNet18. However, it is worthwhile to point out that our method enables us to achieve similar performance with only 70% of its inference time. TextBPN is the best-performed contour-based approach among all baselines. Our method achieves the similar or even better performance in F-measure but enjoys 2.4 times faster inference speed. Clearly, our method enjoys a good tradeoff between detection performance and inference speed. Fig.5 visualizes some detection results of several top-performance detectors on the Total-Text dataset. These visualizations also confirm that our method can provide more accurate detection results and better separate two adjacent text regions.

Quadrangular Text detection. According to the results in Tab.3 and Tab.4, similar phenomena can be observed on two quadrangular text datasets named MSRA-TD500 and ICDAR2015. Our method still achieves competitive performances on these two benchmarks. For example, our method performs best in F-measure on MSRA-TD500 when the backbone is ResNet50. FewNet[25] is the recent SOTA detector. The performance gain of our method over it is 0.8% on MSRA-TD500.

5 Conclusion

In this paper, we present an effective arbitrary-shaped scene text detector named Deformable Kernel Expansion (DKE), which inherits merits of both segmentation and contour based methods. DKE takes advantage of the arbitrary-shaped text region modeling power of segmentation-based detectors at the pixel level by segmenting text kernels as the initial boundaries. Those kernels can be deemed as shrunken versions of the text regions. Based on the text kernel, DKE learns to expand the text kernel rather than adjust the text boundary several times as the conventional contour-based scene text detectors do. Such an expansion enables it to be optimized in only one iteration, which highly speeds up the inference and also avoids the complicated pixel-level post-processing adopted by most of the segmentation-based detectors. Extensive experimental results on several benchmarks validate the effectiveness of DKE and verify that our method can achieve a good tradeoff between accuracy and efficiency.

References

[1] Chee Kheng Ch’ng and Chee Seng Chan. Total-text: A comprehensive dataset for scene text detection and recognition. In 2017 14th IAPR international conference on document analysis and recognition (ICDAR), volume 1, pages 935–942. IEEE, 2017.
[2] Pengwen Dai, Sanyi Zhang, Hua Zhang, and Xiaochun Cao. Progressive contour regression for arbitrary-shape scene text detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7393–7402, 2021.
[3] Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. Synthetic data for text localisation in natural images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2315–2324, 2016.
[4] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
[5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer, 2016.
[6] Wenhao He, Xu-Yao Zhang, Fei Yin, and Cheng-Lin Liu. Deep direct regression for multi-oriented scene text detection. In Proceedings of the IEEE international conference on computer vision, pages 745–753, 2017.
[7] Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, Suman Ghosh, Andrew Bagdanov, Masakazu Iwamura, Jiri Matas, Lukas Neumann, Vijay Ramaseshan Chandrasekhar, Shijian Lu, et al. Icdar 2015 competition on robust reading. In 2015 13th international conference on document analysis and recognition (ICDAR), pages 1156–1160. IEEE, 2015.
[8] Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955.
[9] Minghui Liao, Guan Pang, Jing Huang, Tal Hassner, and Xiang Bai. Mask textspotter v3: Segmentation proposal network for robust scene text spotting. In Proceedings of the European Conference on Computer Vision (ECCV), 2020.
[10] Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang, and Wenyu Liu. Textboxes: A fast text detector with a single deep neural network. In Thirty-first AAAI conference on artificial intelligence, 2017.
[11] Minghui Liao, Zhaoyi Wan, Cong Yao, Kai Chen, and Xiang Bai. Real-time scene text detection with differentiable binarization. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 11474–11481, 2020.
[12] Minghui Liao, Zhisheng Zou, Zhaoyi Wan, Cong Yao, and Xiang Bai. Real-time scene text detection with differentiable binarization and adaptive scale fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
[13] Huan Ling, Jun Gao, Amlan Kar, Wenzheng Chen, and Sanja Fidler. Fast interactive object annotation with curve-gcn. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5257–5266, 2019.
[14] Rosanne Liu, Joel Lehman, Piero Molino, Felipe Petroski Such, Eric Frank, Alex Sergeev, and Jason Yosinski. An intriguing failing of convolutional neural networks and the coordconv solution. Advances in neural information processing systems, 31, 2018.
[15] Yuliang Liu, Hao Chen, Chunhua Shen, Tong He, Lianwen Jin, and Liangwei Wang. Abcnet: Real-time scene text spotting with adaptive bezier-curve network. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9809–9818, 2020.
[16] Yuliang Liu, Chunhua Shen, Lianwen Jin, Tong He, Peng Chen, Chongyu Liu, and Hao Chen. Abcnet v2: Adaptive bezier-curve network for real-time end-to-end text spotting. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2021.
[17] Zichen Liu, Jun Hao Liew, Xiangyu Chen, and Jiashi Feng. Dance: A deep attentive contour model for efficient instance segmentation. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 345–354, 2021.
[18] Shangbang Long, Xin He, and Cong Yao. Scene text detection and recognition: The deep learning era. International Journal of Computer Vision, 129(1):161–184, 2021.
[19] Shangbang Long, Jiaqiang Ruan, Wenjie Zhang, Xin He, Wenhao Wu, and Cong Yao. Textsnake: A flexible representation for detecting text of arbitrary shapes. In Proceedings of the European conference on computer vision (ECCV), pages 20–36, 2018.
[20] Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, and Xiang Bai. Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In Proceedings of the European Conference on Computer Vision (ECCV), pages 67–83, 2018.
[21] Nibal Nayef, Fei Yin, Imen Bizid, Hyunsoo Choi, Yuan Feng, Dimosthenis Karatzas, Zhenbo Luo, Umapada Pal, Christophe Rigaud, Joseph Chazalon, et al. Icdar2017 robust reading challenge on multi-lingual scene text detection and script identification-rrc-mlt. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), volume 1, pages 1454–1459. IEEE, 2017.
[22] Sida Peng, Wen Jiang, Huaijin Pi, Xiuli Li, Hujun Bao, and Xiaowei Zhou. Deep snake for real-time instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8533–8542, 2020.
[23] Tao Sheng, Jie Chen, and Zhouhui Lian. Centripetaltext: An efficient text instance representation for scene text detection. Advances in Neural Information Processing Systems, 34:335–346, 2021.
[24] Baoguang Shi, Xiang Bai, and Serge Belongie. Detecting oriented text in natural images by linking segments. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2550–2558, 2017.
[25] Jingqun Tang, Wenqing Zhang, Hongye Liu, MingKun Yang, Bo Jiang, Guanglong Hu, and Xiang Bai. Few could be better than all: Feature sampling and grouping for scene text detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4563–4572, 2022.
[26] Zhuotao Tian, Michelle Shu, Pengyuan Lyu, Ruiyu Li, Chao Zhou, Xiaoyong Shen, and Jiaya Jia. Learning shape-aware embedding for scene text detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4234–4243, 2019.
[27] Bala R Vatti. A generic solution to polygon clipping. Communications of the ACM, 35(7):56–63, 1992.
[28] Wenhai Wang, Enze Xie, Xiang Li, Wenbo Hou, Tong Lu, Gang Yu, and Shuai Shao. Shape robust text detection with progressive scale expansion network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9336–9345, 2019.
[29] Wenhai Wang, Enze Xie, Xiang Li, Xuebo Liu, Ding Liang, Zhibo Yang, Tong Lu, and Chunhua Shen. Pan++: Towards efficient and accurate end-to-end spotting of arbitrarily-shaped text. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):5349–5367, 2021.
[30] Wenhai Wang, Enze Xie, Xiaoge Song, Yuhang Zang, Wenjia Wang, Tong Lu, Gang Yu, and Chunhua Shen. Efficient and accurate arbitrary-shaped text detection with pixel aggregation network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8440–8449, 2019.
[31] Xiaobing Wang, Yingying Jiang, Zhenbo Luo, Cheng-Lin Liu, Hyunsoo Choi, and Sungjin Kim. Arbitrary shape scene text detection with adaptive text region representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6449–6458, 2019.
[32] Xinlong Wang, Rufeng Zhang, Tao Kong, Lei Li, and Chunhua Shen. Solov2: Dynamic and fast instance segmentation. Advances in Neural information processing systems, 33:17721–17732, 2020.
[33] Yuxin Wang, Hongtao Xie, Zheng-Jun Zha, Mengting Xing, Zilong Fu, and Yongdong Zhang. Contournet: Taking a further step toward accurate arbitrary-shaped scene text detection. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11753–11762, 2020.
[34] Enze Xie, Yuhang Zang, Shuai Shao, Gang Yu, Cong Yao, and Guangyao Li. Scene text detection with supervised pyramid context network. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 9038–9045, 2019.
[35] Cong Yao, Xiang Bai, Wenyu Liu, Yi Ma, and Zhuowen Tu. Detecting texts of arbitrary orientations in natural images. In 2012 IEEE conference on computer vision and pattern recognition, pages 1083–1090. IEEE, 2012.
[36] Liu Yuliang, Jin Lianwen, Zhang Shuaitao, and Zhang Sheng. Detecting curve text in the wild: New dataset and new solution. arXiv: Computer Vision and Pattern Recognition, 2017.
[37] Shi-Xue Zhang, Xiaobin Zhu, Jie-Bo Hou, Chang Liu, Chun Yang, Hongfa Wang, and Xu-Cheng Yin. Deep relational reasoning graph network for arbitrary shape text detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9699–9708, 2020.
[38] Shi-Xue Zhang, Xiaobin Zhu, Chun Yang, Hongfa Wang, and Xu-Cheng Yin. Adaptive boundary proposal network for arbitrary shape text detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1305–1314, 2021.
[39] Tao Zhang, Shiqing Wei, and Shunping Ji. E2ec: An end-to-end contour-based method for high-quality high-speed instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4443–4452, 2022.
[40] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017.
[41] Mengbiao Zhao, Wei Feng, Fei Yin, Xu-Yao Zhang, and Cheng-Lin Liu. Weakly-supervised arbitrary-shaped text detection with expectation-maximization algorithm. arXiv preprint arXiv:2012.00424, 2020.
[42] Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and Jiajun Liang. East: an efficient and accurate scene text detector. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 5551–5560, 2017.
[43] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9308–9316, 2019.
[44] Yiqin Zhu, Jianyong Chen, Lingyu Liang, Zhanghui Kuang, Lianwen Jin, and Wayne Zhang. Fourier contour embedding for arbitrary-shaped text detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3123–3131, 2021.