¹¹institutetext: Preferred Networks, Inc., 1-6-1 Otemachi, Chiyoda, Tokyo, Japan.
¹¹email: [email protected]
https://researchmap.jp/t.kat

Multi-Cell Decoder and Mutual Learning for Table Structure and Character Recognition

Takaya Kawakatsu 11

Abstract

Extracting table contents from documents such as scientific papers and financial reports and converting them into a format that can be processed by large language models is an important task in knowledge information processing. End-to-end approaches, which recognize not only table structure but also cell contents, achieved performance comparable to state-of-the-art models using external character recognition systems, and have potential for further improvements. In addition, these models can now recognize long tables with hundreds of cells by introducing local attention. However, the models recognize table structure in one direction from the header to the footer, and cell content recognition is performed independently for each cell, so there is no opportunity to retrieve useful information from the neighbor cells. In this paper, we propose a multi-cell content decoder and bidirectional mutual learning mechanism to improve the end-to-end approach. The effectiveness is demonstrated on two large datasets, and the experimental results show comparable performance to state-of-the-art models, even for long tables with large numbers of cells.

Keywords:

Deep Learning, Table Recognition, Transformer, Mutual Learning

1 Introduction

Information retrieval technology which provides high-quality knowledge to large language models (LLMs) is attracting attention. Many researchers have worked on converting scanned and imaged documents to machine-readable formats such as HTML code [37, 42, 15] and LaTeX code [3, 12]. This initiative has direct and indirect benefits. First, because past literature remains mostly in printed form, it is necessary to convert it into structured electronic documents. This is a direct benefit. Second, the realization of intelligence which recognizes hidden meanings in the layout of published documents that are intended to be read by humans is important from the perspective of human-machine interaction. This is an indirect benefit.

In this paper, we will focus on table recognition, which includes two types of tasks, namely structure recognition and cell content recognition. A simple table has horizontal and vertical borders, and each cell contains characters. Complex tables may contain cells which are merged vertically or horizontally, and/or cells which involve invisible borders. Persons can understand table structure from the cell layout even without explicit boundaries, and this is a challenging problem.

In recent years, following the success of Transformer [31] models in language and visual recognition tasks, many methods [21, 36, 18, 19] based on Transformer have been proposed for the table recognition task. Since we can utilize an external optical character recognition (OCR) system to parse cell contents, we can focus mainly on table structure recognition. This task exploits cross attention between image features and embedded representations of HTML tokens to predict HTML tokens sequentially. In previous studies [21, 36, 18, 19, 20], the token prediction was performed in one direction from the header to the footer, from left to right. This impairs the opportunity to focus on the table structure ahead.

Ly and Takasu [19] reported that end-to-end learning of table structure and cell content recognition tasks may improve overall table recognition performance. In addition, tables may contain more than a few hundred cells, and the sequential prediction approaches may suffer from poor performance, which can be improved by introducing local attention [18]. These previous studies performed cell content recognition independently for each detected cell after structure recognition. This impairs the opportunity of obtaining useful information from neighbor cells.

As a solution to the problems, we improve the end-to-end approach [18, 19, 20] and propose a method that refers to the recognition results of neighbor cells and a learning mechanism focusing on both previous and following cells. The former is achieved by introducing a cell decoder that infers multiple cells and configuring a hierarchical decoder along with an HTML decoder for structural recognition. The latter is achieved by mutual learning [38] between a forward decoder which reads the table structure from left to right and a backward decoder which reads the table structure from right to left. The effectiveness of the proposed method is demonstrated using two large-scale tabular image datasets.

The main contributions of this paper are: 1) We propose a cell decoder which infers multiple cells and obtains useful information from surrounding cells. 2) We propose a bidirectional mutual learning mechanism to force the proposed model to pay attention to both previous and following cells. 3) Across all experimental results, our proposed method achieved better performance than state-of-the-art models.

2 Related Work

In general, the table recognition task is performed by two subtasks, namely table structure recognition and cell content recognition. Of course, the final output is an HTML [37, 42, 15] or LaTeX [3, 12] document, so there is no need to distinguish between the two subtasks. However, it is better to recognize tags (or commands) and other visible characters using separate models. For cell content recognition task, existing highly accurate OCR systems [17] are available, and thus previous studies [11, 13, 32, 26, 28, 30] have mainly focused on table structure recognition.

Table structure recognition has been studied for a long time, and approaches based on hand-crafted features and heuristic rules [11, 13, 32] were proposed, but their application was limited to simple tables or tables with predefined patterns. With the development of deep learning, methods that automatically learn table structural patterns [39, 27, 26, 28, 30] have become mainstream. These studies can be divided into approaches based on object detection and segmentation [30, 27], and approaches based on sequential token prediction [21, 36].

For the detection and segmentation approaches, Schreiber et al. [30] proposed a two-fold system using Faster R-CNN [29] and fully convolutional networks [16] for both table detection and table structure recognition. Raja et al. [28] proposed a two-stage model that estimates the relationships between cells after recognizing their locations. Qiao et al. [27] won his first place in the ICDAR competition [37] by combining text, cell, row and column recognition tasks using Mask R-CNN [5].

For the sequential token prediction approaches, a simple image caption model can be utilized for cell detection because the order of cells is uniquely determined. Ye et al. [36] and Nassar et al. [21] proposed Transformer models with two types of decoders for table structure recognition and cell localization. Peng et al. [25] achieved performance comparable to a model using a deep convolutional encoder while significantly reducing parameters by introducing a convolutional stem.

In the 2020s, researchers are investigating end-to-end models that learn both table structure and cell content recognition tasks [3]. Zhong et al. [42] proposed a model that uses a ResNet [6] encoder and two LSTM [8] decoders to recognize both table structure and cell contents, but its performance was inferior to models using external OCR.

Ly and Takasu [19] proposed a multi-task model that detects table structure, cell locations, and cell contents. Their model uses a ResNet encoder with global context attention [2] and two Transformer decoders. The first decoder infers the HTML tokens sequentially, and then the second decoder reads the cell contents one by one. This model achieved performance comparable to the models utilizing external OCR. They also proposed weakly supervised learning to reduce the cost of preparing bounding box training data [20] and introduced local attention [1] to effectively recognize tables with a large number of cells [18].

In 2021, the scientific literature parsing competition [37] was held at ICDAR 2021. The competition consisted of document layout recognition task A and table recognition task B. Task B required converting table images to HTML tags with cell contents. The PubTabNet [42] dataset and the final evaluation dataset were provided for this task. The training dataset consists of HTML tokens, cell texts, and cell bounding boxes. There were 30 submissions from 30 teams and most of the top 10 solutions exploited separate OCR models, additional annotation and ensemble techniques.

TabRecSet [35] is a bilingual dataset containing rotated and distorted tables in real photographs for three tasks, namely table detection, structure recognition, and cell content recognition. Detection of such tables is outside the scope of this paper.

3 Background

Similar to previous work [18, 19, 20], our proposed model uses a ResNet encoder and an HTML decoder consisting of multiple attention blocks [31] to infer HTML tokens representing table structure. An additional decoder is exploited to infer cell contents. The encoder and two decoders are trained simultaneously using an end-to-end approach. In this section, we introduce some techniques used by the proposed method described in Section 4.

3.1 Encoder

Previous studies [21, 36, 18] used a convolutional neural network (CNN) to extract image features and fed them to the decoder. CNN is useful for recognizing small characters while preserving locality such as character positions, reducing the size of image features, and improving the computational efficiency and performance of the decoder.

The number of convolutional layers contributes to recognition performance, and many derivatives have been explored to increase it. ResNet [6], which consist of a large number of residual blocks of multiple convolutional layers with simple skip connections, has been commonly used. In addition, ResNeXt [34] with group convolution and DenseNet [9] with more complicated skip connections between all convolutional layers were proposed.

One of the weaknesses of CNN is its poor ability to recognize global context by focusing strongly on local features. As a solution, a global context attention (GCA) block [2] was proposed, defined by Eq. (1).

\bm{y}_{ij}=\bm{x}_{ij}+W_{3}\max(0,\mathrm{LayerNorm}{}(W_{2}\sum_{i}\sum_{j}\mathrm{SoftMax}(W_{1}\bm{x}_{ij})~{}\bm{x}_{ij})),

(1)

where $i,j$ are the pixel indices, $\bm{x},\bm{y}$ are the input and output pixels, respectively. $W_{1},W_{2},W_{3}$ are the weight matrices of three linear layers. $\mathrm{LayerNorm}$ means layer normalization. The softmax function is defined as follows.

\mathrm{SoftMax}(\bm{z}_{ij})=\frac{\exp\bm{z}_{ij}}{\displaystyle\sum_{m}\sum_{n}\exp\bm{z}_{mn}},

(2)

where $m,n$ are the pixel indices. The GCA block should be placed between some residual (or dense) blocks.

3.2 Decoder

Transformer [31] achieves superior performance in both language modeling and visual recognition tasks. Compared to recurrent neural networks including long short-term memory [8], a Transformer itself does not involve recursion, allowing parallel processing of sequential input and output data. It should be noted that recurrent, sequential inference is performed unless the prediction length is fixed. However, Transformer does not require recursion to recognize the context of the sequence, avoiding vanishing gradients and providing better performance.

The key idea of Transformer is called scaled dot product attention. Let $X$ be a sequence of length $l_{x}$ and $d_{x}$ channels, and $Y$ be another input sequence. For self attention, $X,Y$ are the same sequence, and the Transformer pays attention to other parts of $X$ in processing $X$ . For cross attention, $X$ and $Y$ are in different domains, and the Transformer pays attention to some parts from $Y$ in processing $X$ . These mechanisms allow Transformer to learn the context of sequential data and the relationship between visual and language domains.

The attention layer first generates a query $Q$ , key $K$ , and value $V$ from $X,Y$ as defined in Eq. (3).

\begin{array}[]{lll}\bm{q}_{i}&=W_{Q}&\bm{x}_{i},\\ \bm{k}_{j}&=W_{K}&\bm{y}_{j},\\ \bm{v}_{j}&=W_{V}&\bm{y}_{j},\\ \end{array}

(3)

where $i,j$ are the sequence indices, $\bm{q},\bm{k},\bm{v},\bm{x},\bm{y}$ are the elements of $Q,K,V,X,Y$ , respectively. $W_{Q},W_{K},W_{V}$ are the projection matrices.

The output $Z$ of the attention layer is defined by Eq. (4).

Z_{i}=W_{Z}~{}\mathrm{SoftMax}(\frac{QK^{\top}}{\sqrt{d_{k}}})V,

(4)

where $W_{Z}$ is the output projection matrix, and $d_{k}$ is the dimension of $\bm{k}$ . This is the mechanism of the scaled dot product attention [31].

In practice, the attention layer is divided into several groups, each of which pays attention independently and combines the outputs at the end. This is called multi-head attention [31]. Through the above mechanism, Transformer can focus on specific values of $Y$ and incorporate them into the $X$ series.

3.3 Local Attention

Although Transformer has superior ability to recognize long sequences compared to recurrent neural networks, it is still known to perform poorly upon extremely long sequences. Local attention (LA) [1] is a technique designed to handle such long sequences in Transformer.

Let $M_{ij}$ be a mask to focus on the $j$ th element from $i$ th element of $X$ . The output of the local attention layer is defined by Eq. (5), involving causal masking to prevent leakage from subsequent elements.

Z_{i}=W_{Z}~{}\mathrm{SoftMax}(\frac{QK^{\top}}{\sqrt{d_{k}}}+M_{ij})V.

(5)

The mask matrix $M$ is given by Eq. (6).

M_{ij}=\begin{cases}\hfill 0&0\leq i-j\leq w,\\ -\infty&\mathrm{otherwise},\end{cases}

(6)

where $i,j$ are the sequence indices, and $w$ is the width of the sliding window.

3.4 Positional Encoding

Transformer [31] itself have poor ability to know the position of each element in the sequences, and the position information must be provided explicitly. Instead of inputting simple position values, two approaches have been proposed, namely positional embedding [4] and positional encoding [31]. In general, the latter works better on small training datasets.

The output $\bm{p}(n)$ of positional encoding for the index $n$ is defined by Eq. (7).

\bm{p}(n)=\begin{pmatrix}\vdots\\ \sin\frac{n}{10000^{\frac{2k}{d}}}\\[10.0pt] \cos\frac{n}{10000^{\frac{2k}{d}}}\\ \vdots\\ \end{pmatrix},\;\text{where}~{}k\in\left[0,\frac{d}{2}\right).

(7)

$\bm{p}(n)$ must be added directly to the feature vector $\bm{x}(n)$ with $d$ channels.

If the sequence $X$ has two-dimensional positions $(i,j)$ , 2D positional encoding proposed by Zhao et al. [40] may be a better choice. It normalizes the horizontal and vertical coordinates to $[0,1]$ , encodes each with Eq. (7), and then combines them to obtain a single vector. The positional encoding of the $i,j$ th pixel is given by Eq. (8).

\bm{p}_{\mathrm{2D}}(i,j)=\begin{pmatrix}\bm{p}\left(\frac{i}{H}\right)\\[10.0pt] \bm{p}\left(\frac{j}{W}\right)\\ \end{pmatrix},

(8)

where $H,W$ are the height and width for positional normalization. In this paper, we omitted this normalization.

3.5 Mutual Learning

Ensemble learning is commonly used to improve machine learning generalization performance and fitting accuracy by averaging or complementing the outputs of multiple inference models. However, it is computationally more expensive than single models due to the large number of parameters especially for deep learning approaches.

To achieve similar effects using only a single model, knowledge distillation [7] may be an alternative solution. This is a technique which uses a large, complex neural network, i.e., an ensemble model, as a teacher and a small, simple model as a student to obtain higher performance than simply training a student model using the ground-truth data.

Mutual learning [38] may be another solution. Here, multiple student models are trained simultaneously to teach each other, without training a teacher model in advance. In particular, each student model performs supervised learning using ground-truth data and minimizing Kullback–Leibler (KL) divergence [14] so that the distributions of each other’s classification outputs match.

3.6 Metrics

Zhong et al. [42] introduced a tree edit distance based similarity (TEDS) metric for performance evaluation of both table structure and cell content recognition. After converting the recognition results and the ground truth into tree structures of HTML tags, the TEDS score is calculated according to Eq. (9).

\mathrm{TEDS}(T_{a},T_{b})=1-\frac{\mathrm{EditDist}{}(T_{a},T_{b})}{\max(|T_{a}|,|T_{b}|)},

(9)

where $T_{a}$ and $T_{b}$ are the HTML trees, $\mathrm{EditDist}{}$ is the edit distance function, and $|T|$ is the number of nodes in $T$ .

There are two versions of TEDS, namely structural TEDS and total TEDS. The former is calculated for HTML trees excluding cell contents and represents the recognition performance for table structures only. The latter is computed on complete HTML trees including cell contents and indicates the total recognition performance.

In addition, Zhong [42] classified the tables into two subsets, namely simple tables and complex tables. The former are tables without cells which are merged vertically or horizontally, and the latter are the other tables.

4 Proposal

The proposal consists of a ResNet encoder and two local-attention Transformer decoders. The two decoders infer table structure and cell contents, respectively. An additional output layer estimates the cell bounding boxes.

The two main differences with previous studies [18, 19] are 1) introduction of a multi-cell decoder, and 2) introduction of bidirectional mutual learning to the HTML decoder. In addition, 2D positional encoding is employed. We named the proposed method MuTabNet after mutual learning, multi-task learning, and the multi-cell decoder. Fig. 1 shows the network architecture.

4.1 Encoder

The encoder consists of a CNN backbone and 2D positional encoding. The CNN extracts image features of 65x65 pixels from an image of 520x520 pixels. For the CNN, we adopted TableResNetExtra [36] with 26 convolutional layers and three GCA blocks. After 2D positional encoding, the image features are flattened into one-dimensional features with 512 channels for cross-attention at the decoders.

4.2 HTML Decoder

The HTML decoder consists of one embedding layer, three local attention blocks, and two output layers. Each attention block accepts a table structure sequence through a self-attention layer in the block. The attention block then incorporates image features into the table structure sequence through a cross-attention layer, and outputs the sequence through a feed-forward layer. Several skip connections and layer normalizations are inserted within the block. The output from the last attention block is converted into HTML tokens and cell bounding boxes by the two output layers.

During training, the decoder predicts left or right shifted HTML tokens from the input HTML tokens. The shift direction is specified by an additional one-hot vector. During inference, the decoder predicts the following token and iteratively expands the input sequence to obtain the complete HTML sequence.

In addition to HTML tokens, the decoder accepts some special tokens, namely SOS, EOS, and PAD. SOS is a token that triggers sequential inference and is inserted at the beginning of the tokens. EOS is a token that stops inference and is inserted at the end of the tokens. PAD is inserted after EOS to equalize the lengths of the tokens in the mini batch.

Following previous studies [18, 19], the HTML sequence was simply tokenized into HTML tags, except for the <td> tag representing the start of a cell. A <td> tag is tokenized as ‘<td’, ‘colspan="2"’, ‘rowspan="3"’, ‘>’ if it contains colspan or rowspan attributes. Otherwise, the tag is simply tokenized as ‘<td>’. It should be noted that FinTabNet [41] and PubTabNet [42] described in Section 5.1 are publicly available with such tokenization applied. We then merged the <td> and immediately following </td> tokens into one token.

Furthermore, we assigned some special tokens to frequent sequence patterns. If all header cells in the dataset had bold text, we removed the <b> and completed it in post-processing. These methods follow previous studies [18, 19].

4.3 Cell Decoder

The cell decoder consists of one embedding layer, one local attention block, and an output layer. Following previous studies [18, 19], the embedding layer accepts cell characters one by one. This is because cell contents typically consist of short sentences or unknown words or numbers, making it difficult to utilize pretrained language models.

The basic structure of a cell decoder is similar to that of an HTML decoder, with the following differences. First, a special token SEP is inserted between cell contents to trigger movement to the next cells. Second, the cell decoder accepts a combination of cell contents and their corresponding HTML features extracted from the output of the HTML decoder. These improvements allow the proposal to sequentially read the contents of multiple cells while referring to information from previous cells.

The previous study [18] exploited local attention for the HTML decoder and global attention for the cell decoder. This was because the cell decoder processed each cell independently, and the cell contents were short in general. On the other hand, in our proposed multi-cell decoder, the sequence of cell contents tends to be long. Consequently, we employed local attention.

4.4 Bidirectional Mutual Learning

We propose bidirectional mutual learning inspired by deep mutual learning [38] to train the HTML decoder. Here, two equivalent decoders are trained together to predict table structure in either a left-to-right (LtoR) or right-to-left (RtoL) direction. To reduce model parameters, we implemented the mutual learning in a single decoder by combining an additional one-hot vector that determines the direction with the embedded HTML tokens.

Let $\overrightarrow{\bm{x}}$ and $\overleftarrow{\bm{x}}$ be the LtoR and RtoL sequences respectively, and let $p(\bm{x})$ be the ground-truth and $q(\bm{x})$ the predicted probabilities. The loss $\overrightarrow{\mathcal{L}}$ for the LtoR decoder is defined by Eq. (10).

\overrightarrow{\mathcal{L}}=-\frac{1}{N}\sum_{n=1}^{N}p(\overrightarrow{\bm{x}_{n}})\log q(\overrightarrow{\bm{x}_{n}})+\frac{1}{N}\sum_{n=1}^{N}q(\overleftarrow{\bm{x}_{n}})\log\frac{q(\overleftarrow{\bm{x}_{n}})}{q(\overrightarrow{\bm{x}_{n}})}.

(10)

5 Experiments

To evaluate the effectiveness of the multi-cell decoder and bidirectional mutual learning, we conducted experiments on two public table datasets.

5.1 Datasets

We utilized two large datasets, FinTabNet [41] and PubTabNet [42]. In addition, we used a subset named PubTabNet250 [18] for ablation studies. Table 1 shows the statistics for the datasets.

Table 1: The statistics of the table image datasets.

Dataset	Training	Validation	Evaluation
FinTabNet	91,596	10,635	10,656
PubTabNet	500,777	9,115	9,064
PubTabNet250	114,111	2,161	-

5.1.1 FinTabNet

is a large dataset of table images, including HTML labels and cell bounding boxes, extracted from the annual reports of S&P 500 companies. The dataset contains 112k tables and is divided into training set, validation set, and evaluation set. It should be noted that the original FinTabNet confuses validation and evaluation sets. Following previous studies [21, 41], we treated the validation set containing 10,656 images as the evaluation set.

5.1.2 PubTabNet

is a dataset built by collecting scientific articles from the PubMed central open access subset, containing 568k tables and corresponding structure and cell content annotations and cell bounding boxes. PubTabNet provides the training and validation sets, and the evaluation set was provided for the ICDAR competition [37]. We classified the tables into simple tables and complex tables as described in Section 3.6.

5.1.3 PubTabNet250

Ly and Takasu [18] extracted tables with 250 or more HTML tokens from PubTabNet and created a subset named PubTabNet250. They also introduced subsets for tables containing at least 500, 600, and 700 tokens. These subsets were utilized originally [18] to demonstrate the effectiveness of the local attention mechanism. We also utilized these subsets to conduct ablation studies in Section 5.4, approximately reducing training time from 179 hours to 45 hours per model.

5.2 Implementation

The proposed model was implemented in PyTorch using mmcv [22], mmdet [23], and mmocr [24] frameworks and trained on four NVIDIA V100 GPUs with batch size 8 in total. We used Ranger [33] optimizer. The learning rate was initialized to 0.001 for the first 25 epochs, and decreased to 0.0001 and 0.00001 for the next three and last two epochs, respectively.

Each tabular image was normalized and reduced to 520x520 pixels, padding the margins with zeros if necessary. The cell bounding boxes were normalized to have a minimum value of 0 and a maximum value of 1.

HTML tokens and cell contents were converted to 512-dimensional embedded representations. The four attention blocks in the HTML and cell decoders have the same 8-head, 512-channel architecture, and the sliding window size for local attention was set to 300 by default, following previous work [18]. The maximum lengths for table structure sequences and cell content sequences were set to 800 and 8000, respectively, including special tokens. We employed greedy search for sequential prediction.

To ensure a fair comparison with the previous studies, we did not utilize data augmentation or ensemble learning techniques. We also did not take advantage of early stopping.

5.3 Experimental Results

We compared the performance of the proposed model trained on FinTabNet and PubTabNet with the claimed performance of existing models.

5.3.1 FinTabNet

Table 2: Table recognition results on FinTabNet evaluation set.

		TEDS (%)
Model		Structure	Total
EDD	[42]	90.60	-
GTE	[41]	87.14	-
GTE (PT)	[41]	91.02	-
TableFormer	[21]	96.80	-
VAST	[10]	98.63	98.21
Ly et al.	[20]	98.72	95.32
Ly and Takasu	[19]	98.79	-
Ly and Takasu	[18]	98.85	95.74
MuTabNet		98.87	97.69

We evaluated the experimental results of structure recognition and total recognition using the TEDS metric. Table 2 compares the TEDS scores in the test set between the proposal and previous models. The proposal outperforms the previous models with scores of 98.87% and 97.69%. The inference time using the 4 GPUs was 3.78 hours.

The total TEDS score of the proposal was lower than the score of VAST [10], which could be explained by the fact that VAST exploits external OCR for cell content recognition. In contrast, the structural TEDS score of VAST was lower than those of end-to-end approaches [20, 19, 18], including the proposal.

5.3.2 PubTabNet

Table 3: Table recognition results on PubTabNet validation set.

		TEDS (%)
Model		Simple	Complex	Total
EDD	[42]	91.20	85.40	88.30
TabStruct-Net	[28]	-	-	90.10
TableFormer	[21]	95.40	90.10	93.60
SEM	[39]	94.80	92.50	93.70
LGPMA&OCR	[27]	-	-	94.60
VCGroup	[36]	-	-	96.26
VCGroup&ME	[36]	-	-	96.84
VAST	[10]	-	-	96.31
Ly et al.	[20]	97.89	95.02	96.48
Ly and Takasu	[19]	97.92	95.36	96.67
Ly and Takasu	[18]	98.07	95.42	96.77
MuTabNet		98.16	95.53	96.87

Table 4: Table recognition results on PubTabNet evaluation set.

		TEDS (%)
Model		Simple	Complex	Total
LTIAYN	[37]	97.18	92.40	94.84
anyone	[37]	96.95	93.43	95.23
PaodingAI	[37]	97.35	93.79	95.61
TAL	[37]	97.30	93.93	95.65
DBJ	[37]	97.39	93.87	95.66
YG	[37]	97.38	94.79	96.11
XM	[39]	97.60	94.89	96.27
VCGroup	[36]	97.90	94.68	96.32
Davar-Lab-OCR	[37]	97.88	94.78	96.36
Ly et al.	[20]	97.51	94.37	95.97
Ly and Takasu	[19]	97.60	94.68	96.17
Ly and Takasu	[18]	97.77	94.58	96.21
MuTabNet		98.01	94.98	96.53

We evaluated the experimental results of table recognition on the validation set using the TEDS metric. Table 3 compares the scores between the proposal and previous methods. The proposal outperforms all previous methods with scores of 98.16%, 95.53% and 96.87% on simple tables, complex tables, and all tables, respectively. The inference time using the 4 GPUs was 3.23 hours.

We also evaluated our proposal on the evaluation set. Table 4 compares the scores of the proposal with the top 10 solutions of the ICDAR competition [37]. The high scores achieved on both sets indicate high generalization performance of the proposal. The inference time was 3.13 hours.

The score of the proposal was higher than the score of VAST [10]. PubTabNet contains a large amount of training data, and the proposed model appears to be well trained for cell content recognition tasks.

It should be noted that VCGroup&ME [36] utilized additional annotation of bounding boxes of text lines within cell contents and ensemble learning of three models. The proposed model outperforms all other non-end-to-end models which utilized additional annotation and ensemble learning even though our model did not utilize such techniques.

5.4 Ablation Studies

We conducted additional experiments for ablation studies using PubTabNet250 dataset for training and PubTabNet subsets for evaluation.

5.4.1 Effectiveness of Multi-Cell Decoder and Mutual Learning

We evaluated the effectiveness of the proposed methods, namely multi-cell (MC) decoder and bidirectional mutual learning (BML). We trained two models on the training set and calculated the validation scores as displayed in Table 5. We selected previous experimental results [18] as baselines using exactly the same model architecture and dataset except for MC and BML. LA in the table refers to local attention.

Since the previous study [18] focused on performance for long tables, we also calculated TEDS scores for tables containing at least 500, 600, and 700 structure tokens. The MC decoder outperforms the baselines at all table lengths, and BML further improves table recognition performance.

The effect of BML was unclear in the structural TEDS scores but evident in the total TEDS scores. BML may still have improved the performance of implicit structure recognition and may have made an impact on cell content recognition, which requires precise content locations.

5.4.2 Window Size of Cell Decoder

Ly and Takasu [18] has reported that a window size of 300 was optimal for the HTML decoder, whereas the cell decoder exploited global attention. In this study, we determine the optimal window size for the MC decoder. Table 6 shows the change in TEDS scores for the validation set as the window size varies from 100 to 500, while the window size for the HTML decoder was fixed at 300.

In general, a window size of 300 achieved the highest score, with the exception of tables containing more than 500 tokens, where a window size of 100 achieved the highest score. Tables with many cells tend to have fewer characters per cell, and a shorter window may be sufficient.

It should be noted that we used the PubTabNet250 dataset for training, and the performance for tables with fewer structure tokens was lower than the scores in Table 3. We selected the window size of 300 as the optimal value for the entire PubTabNet dataset containing tables with fewer tokens from the perspective of generalization performance.

Table 5: Table recognition results with the proposed methods.

			TEDS (%)
Methods			Structure				Total
LA	MC	BML	250+	500+	600+	700+	250+	500+	600+	700+
-	-	-	-	-	-	-	93.86	91.16	90.63	88.65
✓	-	-	-	-	-	-	94.28	92.99	91.29	89.61
✓	✓	-	96.60	96.71	96.75	96.67	95.02	94.59	93.73	93.14
✓	✓	✓	97.02	96.70	96.35	96.65	95.81	95.11	94.05	94.02

Table 6: Table recognition results with respect to window sizes.

Size	250+	500+	600+	700+	0+	250+	500+	600+	700+
	TEDS (%)
	Structure				Total
100	96.96	96.69	96.98	96.60	75.91	95.70	95.19	94.99	94.35
200	96.79	96.53	96.30	95.83	75.79	95.46	94.66	93.80	92.69
300	97.02	96.70	96.35	96.65	83.15	95.81	95.11	94.05	94.02
400	96.83	96.85	96.48	96.51	82.58	95.40	95.08	94.00	93.50
500	96.97	96.74	97.03	96.54	81.14	95.51	94.46	93.88	92.65

6 Conclusion

We improved an end-to-end table recognition model based upon Transformer to achieve performance comparable to state-of-the-art models using external OCR systems. The proposed model consists of a ResNet encoder and two decoders for structure recognition and cell content recognition. After the first decoder infers the structure tokens, the second decoder reads the text within each cell.

We proposed a multi-cell decoder for cell content recognition to exploit useful information from neighbor cells. Furthermore, we proposed bidirectional mutual learning to force the model to pay attention to both previous and following cells. Experimental results using two public datasets demonstrate the effectiveness of the proposed methods.

In future work, we will further consider multitasking models that include the task of recognizing the meaning of tables, which enables deep understanding of printed documents, including table contents, and provides high-quality scientific knowledge for LLMs and question-answering systems.

References

[1] Beltagy, I., Peters, M.E., Cohan, A.: Longformer: The long-document transformer (2020)
[2] Cao, Y., Xu, J., Lin, S., Wei, F., Hu, H.: GCNet: Non-local networks meet squeeze-excitation networks and beyond. In: IEEE/CVF International Conference on Computer Vision Workshop (ICCVW). pp. 1971–1980 (2019)
[3] Deng, Y., Rosenberg, D., Mann, G.: Challenges in end-to-end neural scientific table recognition. In: International Conference on Document Analysis and Recognition (ICDAR). pp. 894–901 (2019)
[4] Gehring, J., Auli, M., Grangier, D., Yarats, D., Dauphin, Y.N.: Convolutional sequence to sequence learning. In: International Conference on Machine Learning (ICML). pp. 1243–1252 (2017)
[5] He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: IEEE International Conference on Computer Vision (ICCV). pp. 2980–2988 (2017)
[6] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 770–778 (2016)
[7] Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network (2015)
[8] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9, 1735–80 (1997)
[9] Huang, G., Liu, Z., Maaten, L.V.D., Weinberger, K.Q.: Densely connected convolutional networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2261–2269 (2017)
[10] Huang, Y., Lu, N., Chen, D., Li, Y., Xie, Z., Zhu, S., Gao, L., Peng, W.: Improving table structure recognition with visual-alignment sequential coordinate modeling. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 11134–11143 (2023)
[11] Itonori, K.: Table structure recognition based on textblock arrangement and ruled line position. In: International Conference on Document Analysis and Recognition (ICDAR). pp. 765–768 (1993)
[12] Kayal, P., Anand, M., Desai, H., Singh, M.: ICDAR 2021 competition on scientific table image recognition to LaTeX. In: International Conference on Document Analysis and Recognition (ICDAR). pp. 754–766 (2021)
[13] Kieninger, T.G.: Table structure recognition based on robust block segmentation. In: Photonics West ’98. vol. 3305, pp. 22–32 (1998)
[14] Kullback, S., Leibler, R.A.: On information and sufficiency. The Annals of Mathematical Statistics 22(1) (1951)
[15] Li, M., Cui, L., Huang, S., Wei, F., Zhou, M., Li, Z.: TableBank: Table benchmark for image-based table detection and recognition. In: Language Resources and Evaluation Conference (LREC). pp. 1918–1925 (2020)
[16] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3431–3440 (2015)
[17] Lu, N., Yu, W., Qi, X., Chen, Y., Gong, P., Xiao, R., Bai, X.: MASTER: Multi-aspect non-local network for scene text recognition. Pattern Recognition 117, 107980 (2021)
[18] Ly, N.T., Takasu, A.: An end-to-end local attention based model for table recognition. In: International Conference on Document Analysis and Recognition (ICDAR). pp. 20–36 (2023)
[19] Ly, N.T., Takasu, A.: An end-to-end multi-task learning model for image-based table recognition. In: International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP). pp. 626–634 (2023)
[20] Ly, N.T., Takasu, A., Nguyen, P., Takeda, H.: Rethinking image-based table recognition using weakly supervised methods. In: International Conference on Pattern Recognition Applications and Methods (ICPRAM). pp. 872–880 (2023)
[21] Nassar, A., Livathinos, N., Lysak, M., Staar, P.: TableFormer: Table structure understanding with transformers. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4604–4613 (2022)
[22] OpenMMLab: MMCV, https://github.com/open-mmlab/mmcv
[23] OpenMMLab: MMDetection, https://github.com/open-mmlab/mmdetection
[24] OpenMMLab: MMOCR, https://github.com/open-mmlab/mmocr
[25] Peng, A., Lee, S., Wang, X., Balasubramaniyan, R., Chau, D.H.: High-performance transformers for table structure recognition need early convolutions. In: Table Representation Learning Workshop (2023)
[26] Prasad, D., Gadpal, A., Kapadni, K., Visave, M., Sultanpure, K.: CascadeTabNet: An approach for end to end table detection and structure recognition from image-based documents. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). pp. 2439–2447 (2020)
[27] Qiao, L., Li, Z., Cheng, Z., Zhang, P., Pu, S., Niu, Y., Ren, W., Tan, W., Wu, F.: LGPMA: Complicated table structure recognition with local and global pyramid mask alignment. In: International Conference on Document Analysis and Recognition (ICDAR). pp. 99–114 (2021)
[28] Raja, S., Mondal, A., Jawahar, C.V.: Table structure recognition using top-down and bottom-up cues. In: Computer Vision – ECCV. pp. 70–86 (2020)
[29] Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems. vol. 28 (2015)
[30] Schreiber, S., Agne, S., Wolf, I., Dengel, A., Ahmed, S.: DeepDeSRT: Deep learning for detection and structure recognition of tables in document images. In: International Conference on Document Analysis and Recognition (ICDAR). pp. 1162–1167 (2017)
[31] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems. vol. 30, pp. 6000–6010 (2017)
[32] Wang, Y., Phillips, I.T., Haralick, R.M.: Table structure understanding and its performance evaluation. Pattern Recognition 37(7), 1479–1497 (2004)
[33] Wright, L.: Ranger - a synergistic optimizer (2019), https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer
[34] Xie, S., Girshick, R.B., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5987–5995 (2017)
[35] Yang, F., Hu, L., Liu, X., Huang, S., Gu, Z.: A large-scale dataset for end-to-end table recognition in the wild. Scientific Data 10(1), 110 (2023)
[36] Ye, J., Qi, X., He, Y., Chen, Y., Gu, D., Gao, P., Xiao, R.: PingAn-VCGroup’s solution for ICDAR 2021 competition on scientific literature parsing task B: Table recognition to HTML (2021)
[37] Yepes, A.J., Zhong, P., Burdick, D.: ICDAR 2021 competition on scientific literature parsing. In: International Conference on Document Analysis and Recognition (ICDAR). pp. 605–617 (2021)
[38] Zhang, Y., Xiang, T., Hospedales, T.M., Lu, H.: Deep mutual learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4320–4328 (2018)
[39] Zhang, Z., Zhang, J., Du, J., Wang, F.: Split, embed and merge: An accurate table structure recognizer. Pattern Recognition 126, 108565 (2022)
[40] Zhao, W., Gao, L., Yan, Z., Peng, S., Du, L., Zhang, Z.: Handwritten mathematical expression recognition with bidirectionally trained transformer. In: International Conference on Document Analysis and Recognition (ICDAR). pp. 570–584 (2021)
[41] Zheng, X., Burdick, D., Popa, L., Zhong, X., Wang, N.X.R.: Global table extractor (GTE): A framework for joint table identification and cell structure recognition using visual context. In: IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 697–706 (2021)
[42] Zhong, X., ShafieiBavani, E., Yepes, A.J.: Image-based table recognition: Data, model, and evaluation. In: Computer Vision – ECCV. pp. 564–580 (2020)