Multi-Stage Attentional U-Net \addauthorTuan Anh Nguyen [email protected] \addauthorDat Nguyen [email protected] \addinstitution Cinnamon AI Labs
Hanoi, Vietnam \addinstitution Cinnamon AI Labs
Hanoi, Vietnam

End-to-End Information Extraction by
Character-Level Embedding and Multi-Stage Attentional U-Net

Abstract

Information extraction from document images has received a lot of attention recently, due to the need for digitizing a large volume of unstructured documents such as invoices, receipts, bank transfers, etc. In this paper, we propose a novel deep learning architecture for end-to-end information extraction on the 2D character-grid embedding of the document, namely the Multi-Stage Attentional U-Net. To effectively capture the textual and spatial relations between 2D elements, our model leverages a specialized multi-stage encoder-decoders design, in conjunction with efficient uses of the self-attention mechanism and the box convolution. Experimental results on different datasets show that our model outperforms the baseline U-Net architecture by a large margin while using 40% fewer parameters. Moreover, it also significantly improved the baseline in erroneous OCR and limited training data scenario, thus becomes practical for real-world applications.

1 Introduction

Information Extraction (IE) is the process of extracting structured information from unstructured documents. Traditional document processing systems prior to deep learning era often rely on template matching methods to extract information from fixed-format documents [Hanchuan Peng et al.(2000)Hanchuan Peng, Fuhui Long, Wan-Chi Siu, Zheru Chi, and David Dagan Feng, Hanchuan Peng et al.(2003)Hanchuan Peng, Fuhui Long, and Zheru Chi]. However, these methods have scaling issues to operate on a large number of templates, and they are also sensitive to slight input distortions. A more general approach to handle varied-format document would be a "document understanding" system, which utilizes a layout analysis step to detect the structure of text-lines, tables and paragraphs in the input image [Mao et al.(2003)Mao, Rosenfeld, and Kanungo, Ghai and Jain(2013)]. After that, an information extraction step [Simoes, Goncalo; Galhardas, Helena; Coheur(2005), Rusinol et al.(2013)Rusinol, Benkhelfallah, and Dandecy] is required to determine their functional roles and semantics. Designing such a method is considered as a difficult task, especially when the input documents have a high degree of variance and complex layouts [Yang et al.(2017)Yang, Yumer, Asente, Kraley, Kifer, and Giles]. Moreover, the errors from preceding processes such as text recognition (OCR) and layout analysis further hinder the performance of various IE methods. Hence, there is still much room for improvement to make automatic document extraction become feasible.

To address the aforementioned problems, we present an end-to-end deep neural network architecture, so-called Multi-Stage Attentional U-Net, which can be trained directly with supervision from annotated document data. Using the 2D-character grid (or char-grid for short) representation of the document as described in [Katti et al.(2018)Katti, Reisswig, Guder, Brarda, Bickel, Höhne, and Faddoul], our network performs pixel-level semantic segmentation task to label and extract the relevant information. The char-grid embedding has some substantial differences from natural images, featuring both strong local texture cues and global spatial relationships, thus is challenging for conventional semantic segmentation architecture. In order to correctly exploit both textual and spatial features of the char-grid, our network leverages a specialized multi-stage encoder-decoders design, where the former encoder-decoder blocks focus on the textual components to identify the important elements (e.g. headers/keywords), and the later ones attempt to learn the spatial relations between elements and propagate that information from the intermediate context to the target values. Long-range dependencies and correlations between spatial positions on the document are captured by an additional self-attention mechanism [Zhang et al.(2018)Zhang, Goodfellow, Metaxas, and Odena] and the box convolution layer [Burkov(2018)], which further strengthen the ability of the proposed network to model diverse document layouts. All modules are differentiable and can be jointly trained in an end-to-end fashion. Notably, the training process utilizes a multi-task training scheme with an auxiliary loss and a data augmentation process to enhance the generalization of our model. We validate our method on various real-world information extraction datasets and achieve significant improvements over the baseline U-Net model in all tasks while using fewer parameters. The network also shows good performance in limited training data scenarios and erroneous results from the OCR step, which matches the standards of industrial needs for automatic document extraction pipeline. In summary, our contribution is several-fold:

•

We proposed a specialized deep neural network architecture to perform the information extraction task on the 2D character-grid, which largely surpasses the strong baseline U-Net and has 40% fewer parameters.
•

We systematically evaluate the use of self-attention mechanism, box-convolution and multi-stage encoder-decoders upon modeling complex spatial relations in the document layout. Extensive ablation studies are conducted to verify the effectiveness of each component.
•

Rigorous experiments are performed on our in-house customer-provided datasets to validate the robustness of our method under harsh conditions (limited training data, heavy OCR errors).

2 Related works

Information extraction techniques for document images can be roughly classified into three groups: the template matching-based methods, the heuristic-based methods and the deep-learning-based methods. Those state-of-the-art approaches are reviewed as follows:

Template Matching Based Methods: This research direction formulates the problem of information extraction as document registration [Hanchuan Peng et al.(2000)Hanchuan Peng, Fuhui Long, Wan-Chi Siu, Zheru Chi, and David Dagan Feng, Hanchuan Peng et al.(2003)Hanchuan Peng, Fuhui Long, and Zheru Chi] or form classification [Aldavert et al.(2017)Aldavert, Rusinol, and Toledo], in which image would first be matched with a type of form in a database and the information will be extracted from predefined areas. Further readings on document image information extraction that uses template matching methods can be referred in [Hanchuan Peng et al.(2000)Hanchuan Peng, Fuhui Long, Wan-Chi Siu, Zheru Chi, and David Dagan Feng, Hanchuan Peng et al.(2003)Hanchuan Peng, Fuhui Long, and Zheru Chi]. However, these methods, in general, have certain drawbacks while dealing with undefined and unseen structured form, which hinder wide-adaptation in automatic document processing pipelines.

Heuristic Methods: The conventional information extracting pipeline [Bukhari et al.(2011)Bukhari, Shafait, and Breuel, Namboodiri and Jain(2007)] often tries to build up the components using carefully designed feature engineering. Kise et al. in their seminal work [Kise et al.(1999)Kise, Iwata, and Matsumoto] introduced a selective method based upon the edges of primary and residual Voronoi graphs that can reduce the cost of character-wise segmentation, which mainly relies on connected component labeling of binary inputs. As conventional computer vision-based segmentation methods evolved, their adaptations to process image document were exploited: while [Chakrabarti et al.(2008)Chakrabarti, Kumar, and Punera], [Oro and Ruffolo(2009)] and [Co(2014)] used text lines to detect tables or segment page in to regions, [Wang et al.(2004)Wang, Phillips, and Haralick] defines table reading orders and heuristically performs table understanding on segmented text lines and separators. One major limitation of the heuristic approaches is that they assume a certain condition on the dataset, which limits their generality to be used with different types of forms.

Deep learning based methods Instead of using hand-crafted feature design, recent development in deep learning on image segmentation like Fully Convolutional Networks [Long et al.(2015)Long, Shelhamer, and Darrell] and U-Net [Ronneberger et al.(2015)Ronneberger, Fischer, and Brox] have allowed early attempts in training end to end text line-level segmentation and labeling like that of [Oliveira et al.(2018)Oliveira, Seguin, and Kaplan] which applied fully convolutional network to directly extract regions of interest from historical documents. Our proposed method is more aligned with recent end-to-end systems which try to incorporate both semantic structure and spatial image features [Yang et al.(2017)Yang, Yumer, Asente, Kraley, Kifer, and Giles, Palm et al.(2018)Palm, Laws, and Winther, Katti et al.(2018)Katti, Reisswig, Guder, Brarda, Bickel, Höhne, and Faddoul]. While [Yang et al.(2017)Yang, Yumer, Asente, Kraley, Kifer, and Giles] use sentence-level embedding which assumes perfect text line grouping to segment the document semantically and targets to conventional documents, we aim for information extraction from visually rich documents, of which character-level segmentation is essential.

To the best of our knowledge, the proposed neural network-based model is essentially different from the two aforementioned approaches: since we are considered the problems of progressive information extraction (i.e. first the ‘key’ is segmented before the whole ‘key-value’ is segmented), the potential of multiple stacked U-Nets has to be considered, for which we adapted the structure of the recent coupled U-Net [Tang et al.(2018)Tang, Peng, Geng, Zhu, and Metaxas], while [Yang et al.(2017)Yang, Yumer, Asente, Kraley, Kifer, and Giles, Palm et al.(2018)Palm, Laws, and Winther, Katti et al.(2018)Katti, Reisswig, Guder, Brarda, Bickel, Höhne, and Faddoul] leverage variants of a single-output fully convolutional neural networks. Further, the model is adapted with global information propagation capability by the use of self-attention mechanism on each convolutional block, it is noted that box convolution [Burkov(2018)] is also leveraged to support distanced region information incorporation.

Regarding the embedding, the char-grid [Palm et al.(2018)Palm, Laws, and Winther] is chosen and tweaked over the hierarchical levels of sub-word [Bojanowski et al.(2016)Bojanowski, Grave, Joulin, and Mikolov], word [Mikolov et al.(2013)Mikolov, Sutskever, Chen, Corrado, and Dean, Pennington et al.(2014)Pennington, Socher, and Manning] or sentence level embedding which have been used in [Yang et al.(2017)Yang, Yumer, Asente, Kraley, Kifer, and Giles], because, as observed, we note that there are cases like numbers, formatted date, etc. that cannot appropriately be represented using a finite number of vectors nor be modeled accurately by N-grams.

3 Method

3.1 Chargrid Conversion

The document image, after being processed by a standalone OCR module to form the text lines, can help to construct the chargrid as follows:

•

Character separation: We first convert the text line boxes to character boxes by dividing them horizontally to the number of character reside in the text with the assumption that in each text line, the variance of width and height of each character are minimal. After this step, a list of character boxes $L_{c}=\{(c_{1},b_{1}),(c_{2},b_{2}),\ldots,(c_{N},b_{N})\}$ is obtained where $c_{i}$ is the character index and $b_{i}$ its respective bounding box.
•

Character encoding: Each character is indexed, a constant number of most frequent characters $N_{char}$ is chosen. The character boxes are then used to create a mask $CM\in\{0,1,\dots,N_{char}\}^{H\times W}$ . For each character box in the list with the character index $i_{char}\in\{1,\dots,N_{char}\}$ , the region of the mask covered by the box will be filled with the value $i_{char}$ . $CM$ is then converted to one-hot format $CM_{one\_hot}\in\{0,1\}^{H\times W\times(N_{char}+1)}$

At this point, the problem of information extraction is now equivalent to the pixel-wise segmentation on the chargrid: Taking $CM_{one\_hot}$ as input, we wish to train the neural network that can output the mask $M\in\{0,\dots,K-1\}^{H\times W}$ masking the regions of $K$ target fields, with the supervision from labeled mask $LM\in\{0,\dots,K-1\}^{H\times W}$ for each input document.

3.2 Network architecture

Refer to caption — Figure 1: Architecture of the Multi-Stage Attentional U-Net. The input chargrid goes through several of encoder-decoder blocks to generate the output mask. Both encoder-decoder blocks are tightly connected by the additional coupling-connections. An auxiliary loss is used during training to increase the context-awareness of the model.

The architecture of our model is depicted in Figure 1. We use the backbone as Coupled U-Net [Tang et al.(2018)Tang, Peng, Geng, Zhu, and Metaxas], with the inspiration behind this utilization is the benefit of multi-stage outputs. For example, in order to understand what kind of information a value in a table represents, a person have to look at the top header or the side description (see Figure 2(a)). This order of behavior can be satisfied by forcing the middle U-Net block to segment key and/or description while the final end outputs the full segmentation of all key/value collections. In addition, other advantages of the Coupled U-Net are also leveraged such as the avoidance of gradient vanishing while promoting feature reuse by the dense skip-connections between corresponding encoder and decoder blocks of each U-Net to another; and the use of auxiliary loss applied to the output of middle U-Net block.

The detail is further explained as follows: Firstly, for the sake of flexibility, each encoder/decoder block is constructed by chaining multiple ResBlocks, each of which is a ResNet-like bottleneck, with ( $3\times 3$ ) convolutions, batch normalization and residual connection. The downsampling blocks are equipped with dilated convolution while the upsampling blocks use transposed convolution blocks for upscaling their resolutions. Secondly, from an intuitive observation that different information flows from table headers and descriptors are required to extract the right target value (illustrated in Figure 2(a)), we incorporate the Box-Convolution layer in [Burkov(2018)] (Figure 2(b)) to a variant of ResBlock, which helps modeling long range vertical and horizontal interaction. The box convolutions [Burkov(2018)] are sliding mean window with learnable offsets instead of weights over windows (see Figure 2(b)). Learning the offset patterns aids the model in aggregating information from distant regions in a straight forward way, making it a suitable alternative to conventional convolution blocks. Thirdly, to strengthen the Coupled U-Net design, expansion paths (also known as skip-connection) from the same level of downsampling blocks to upsampling blocks are added, with self-attention mechanism (non-local network) [Zhang et al.(2018)Zhang, Goodfellow, Metaxas, and Odena] (see Figure 3) to promote global information propagation between each encoder-decoder pair of an U-Net block.

In short, with the described building blocks, we have built an architecture that is light-weight and feasible to exploit both textual and spatial relations from the chargrid embedding.

3.3 Training specification

We specify the additional detail about the training stage, including model hyper-parameters, the loss function and the data augmentation process. The hyper-parameters of the standard $msau$ model is chosen as follows: the number of most-frequent characters $N_{char}=256$ , number of U-Net blocks $n_{blocks}=2$ , the depth (number of convolution operations) of the ResBlock $res\_depth=2$ , the number of channels of the first input feature map $C=16$ , the number of downsampling blocks in an encoder $n_{downsampling}=4$ . $msau\_big$ has the same configuration except $n_{downsampling}=5$ . The same setups are also used for our baseline variants of U-Net, which contain a single encoder-decoder block. Our code, trained model and test samples will be tentatively released via this link¹¹1https://github.com/datvo06/MSAU to foster the reproducibility of this work.

Loss function. The multi-task loss used in the training process is a weighted sum of two smaller losses: the ending loss of the final output and the auxiliary loss of the intermediate output. We adopt the Focal Loss [Lin et al.(2017)Lin, Goyal, Girshick, He, and Dollar] to resolve the class imbalance problem during training. The auxiliary loss is computed from the ground-truth keys mask, after removing all the values. The corresponded weight are $\gamma=0.4$ for the auxiliary loss, and $1-\gamma$ is applied to the ending loss.

Data augmentation. In order to increase the robustness of the model against varied layouts, we perform various types of data augmentation. The text-boxes in the input document are re-scaled so that the median text height is 3 pixels in the char-grid. We use the following steps for augmentation: random character replacement to mimic OCR errors, random text box shifting, random affine transformation, random background padding.

4 Experiments

We evaluate and analyze our approach on two customer-provided documents: self-collected Japanese invoices and the medical receipts datasets.

4.1 Datasets

Our in-house datasets comprise of challenging document images with a variety of layouts. The images are also corrupted by several distortions such as noises, blurs, missing strokes due to the scanning process or poor paper quality. These conditions often appear in a practical document processing pipeline, thus make our datasets a good benchmark for information extraction methods. Ground-truth for the data was annotated manually by human operators, which includes the location and content of text-boxes as well as the key information boxes that need to be extracted.

Japanese invoices dataset consists of 261 invoices in Japanese from several vendors. Each invoice contains 16 key-value pairs to extract. Example keys are date_issue, sender_name, receiver_name, total_amount, tax, item_name, item_amount, etc. A few sample images are shown in the supplement (Figure 5). This dataset features some notable challenges such as: mixed handwriting and printed text, random element placements, low resolution scanning.

Insurance medical receipts dataset consists of 200 medical receipts in Japanese with varied formats. Each document has 12 keys to be extracted: date_issue, billing_period, insurance_amount, patient_co-payment_ratio, surgery_fee, hospitalization_fee, etc. Examples of this dataset can be accessed via the supplement as well (see Figure 6). The challenges of this data include long-range key-value correlation, multiple keys relation, skewed images.

Both datasets are divided into 70% for training and 30% for testing.

4.2 Implementation details

The proposed network was implemented using Tensorflow library. We perform experiments on a server equipped by an Nvidia Quadro M4000 (8GB memory GPU). All models were trained for 200 epochs with a mini-batch size of 4. The RMSProp method [Tieleman and Hinton(2012)] is used for optimizing the model. The learning rate is set to $0.001$ at the beginning and decreases every 10 epochs with a polynomial decay of 0.9.

Baselines. In our experiment, we used two U-Net variants with dilated convolution [Papandreou et al.(2015)Papandreou, Kokkinos, and Savalle, Chen et al.(2016)Chen, Papandreou, Kokkinos, Murphy, and Yuille] and residual bottleneck in the downsampling path, which is similar to the encoder backbone of our architecture. The first variant has $n_{downsampling}=5$ , $res\_depth=2$ and $C=16$ . The second variant has $n_{downsampling}=6$ , $res\_depth=3$ . We abbreviate them as $unet\_small$ and $unet\_big$ , respectively. These two networks are very strong baselines due to the proven effectiveness of their sub-components. Additionally, the same architecture was also presented in [Katti et al.(2018)Katti, Reisswig, Guder, Brarda, Bickel, Höhne, and Faddoul], which allows us to make a direct comparison to one of the current state-of-the-art methods. Table 2 summarized the number of parameters of our network and the baselines. As will be shown shortly, although our $msau\_small$ uses 40% less parameters, it is still able to outperform $unet\_big$ by a large margin in term of accuracy.

Metrics. To compare the performance across models, we use three metrics: mean Intersection-over-Union (mIOU), mean pixel accuracy (pix acc) and the box F1-score (F1-score). We note that we leveraged a modified metric, so-called box F1-score, to evaluate the correctness of the predicted bounding box for each key field. The bounding box of each field is constructed from the connected component analysis on the output segmentation mask. A predicted box is marked as a correct detection if its IoU ratio with one of the ground-truth boxes is larger than a certain threshold $0.8$ . This metric is suitable for testing the model on practical OCR-ed document where the ground-truth mask may slightly misalign with the generated char-grid from the OCR text.

Table 1: Number of trainable parameters

Model	Number of
	parameters
$unet\_small$	$6.7\times 10^{5}$
$unet\_big$	$10.6\times 10^{5}$
$msau$ (ours)	$6.6\times 10^{5}$
$msau\_big$ (ours)	$10.5\times 10^{5}$

Table 2: Performance comparison between ground-truth and OCR text

Dataset	Model	F1-score	F1-score
		(GT text)	(OCR)
Invoice	$unet\_big$	$87.3$	$79.7$ ( $-8\%$ )
	$msau\_big$ (ours)	$96.0$	$91.2$ ( $-5\%$ )
Insurance	$unet\_big$	$89.9$	$83.4$ ( $-7\%$ )
	$msau\_big$ (ours)	$96.1$	$92.4$ ( $-4\%$ )

4.3 Results

Overall results. We first compare the performance of our $msau$ models to the baseline architectures. Table 3 shows the results of all models that have been deployed on the Japanese invoices testset. As we can see, the MSAU-nets surpassed the baseline U-Net models in all metrics by a significant margin. The base $msau$ model improved the $mIOU$ score by $7\%$ compare to the $unet\_big$ , with only 60% the number of parameters. The $msau\_big$ further increased this gap to 9%, leading to an remarkable 40% error reduction. This demonstrates the effectiveness and efficiency of our architectures. $F1$ - $score$ is also largely improved, with a 9% increase from $unet\_big$ to $msau\_big$ and 7% increase from $unet\_big$ to $msau$ .

Table 3: Comparison on Japanese invoices (left) and medical receipts (right) datasets

Model	mIOU	pix acc	F1-score
$unet\_small$	$74.4$	$79.2$	$83.2$
$unet\_big$	$78.0$	$85.7$	$87.3$
$msau$ (ours)	$85.5$	$91.1$	$94.6$
$msau\_big$ (ours)	87.2	92.5	96.0

Model	mIOU	pix acc	F1-score
$unet\_small$	$78.9$	$83.3$	$86.2$
$unet\_big$	$81.7$	$86.5$	$89.9$
$msau$ (ours)	$86.4$	$91.8$	$95.0$
$msau\_big$ (ours)	89.1	93.3	96.1

Similar achievements are expected while testing on the Insurance medical receipts datasets (Table 3). The MSAU models continue to outperform the baseline U-Nets consistently. The $msau\_big$ model leads to an obvious 8% improvement in $mIOU$ from the $unet\_big$ . The base $msau$ also demonstrated good parameter efficiency by delivering better performance compared to $unet\_big$ (5%) in both $mIOU$ and $F1$ - $score$ .

We further evaluate our models on the two datasets with output text from an OCR engine (Tesseract [Smith(2007)]). As can be seen in Table 2, the $msau\_big$ model shows its robustness to OCR errors with 5% and 4% drops in $F1$ - $score$ , while the baseline $unet\_big$ has 8% and 7% drops respectively. Furthermore, the gap between $unet\_big$ and $msau\_big$ is enlarged to 9 $\sim$ 10% in the OCR-ed documents. This shows that our method is able to detect the errors produced by OCR step by a successful rate.

For a more comprehensive comparison, we show our model results across different levels of synthetic OCR errors (randomly replacing/removing character with probability $e$ ) in Figure 4(a). The impact of OCR errors is less severe on our MSAU-net as the proposed model still achieve more than 70% $F1$ - $score$ in an extreme condition where character error rate is as high as 25%.

Ablation study of sub-components. In this part, we perform a step-wise decomposition on our model to study the effect of each sub-component. Experiment is conducted using the $unet\_big$ and $msau\_big$ baseline on the Japanese Invoice dataset. The performance results are shown in Table 4.

Table 4: Performance comparison with different model configurations. Abbreviations: NL - Non-local self-attention, MS - Multi-Stage learning, BC - Box Convolution, CUNet - Coupled U-Net

#	Model	Params ( $\times 10^{5}$ )	mIOU	F1-score
1	UNet ( $unet\_big$ )	$10.6$	$78.0$	$87.3$
2	UNet+NL	$10.5$	$79.2$	$88.1$
3	UNet+NL+BC	$10.8$	80.7	91.2
4	CUNet	$10.3$	$79.5$	$89.6$
5	CUNet+MS	$10.3$	$82.3$	$92.5$
6	CUNet+NL	$10.5$	$81.5$	$91.3$
7	CUNet+MS+BC	$10.3$	$86.2$	$95.1$
8	CUNet+NL+BC	$10.5$	$84.6$	$94.6$
9	CUNet+NL+BC+MS ( $msau\_big$ )	$10.5$	87.2	96.0

As we can see, the base CU-Net model (4) improves the mIOU score by 2% compared to the base U-Net (1). It also surpasses the UNet+NL (2) configuration. The Non-local (self-attention) block helps the base model gained 1% and 2% in $mIOU$ , respectively, in cases of the U-Net (2) and CU-Net (6) baseline. The CU-Net configuration gained much greater improvement in comparison with the base U-Net when adding the box convolution (3% and 1% $mIOU$ in (3) and (8)), respectively. This demonstrates the effectiveness of the component, especially when combined with the CU-Net for modeling spatial relation in the char-grid. Notably, with the multi-stage training (5) plus the box convolution, configuration (7) improves the overall $mIOU$ by 4%. This is a significant achievement because in order to reach the same number, $unet\_small$ has to transform to $unet\_big$ with more 40% parameters. The effect of box convolution can be further analyzed in the Figure 5, in which we can see that the improved result come from the larger receptive field by expanding the box convolution filter dimension.

Ablation study of multi-stage training. Table 4 shows that the multi-stage training (9) increased the $mIOU$ by a significant 3%, compared to the CUNet+NL+BC (8) configuration. This observation also holds in the base CU-Net model with 3% improvement from (4) to (5). We can see that the multi-stage training greatly helps to gain the performance while introducing no additional complexity to the model. One explanation is that the separation of textual and spatial feature learning in the auxiliary loss can encourage each sub-component to focus on certain task, hence better generalization. Note that the multi-stage paradigm can not be applied to the U-Net baseline due to the single encoder-decoder design. The best CU-Net configuration with multi-stage greatly outperformed the best U-Net model by more than 6.5% $mIOU$ . Lastly, we found that the multi-stage training also helps increase the robustness of the model in limited training data scenarios. Figure 4(b) shows the impact of different train/val ratios on the model performance. It can be seen that the auxiliary loss significantly boost the performance in all cases. More importantly, the improvement is more noticeable in the case of less training data.

Qualitative result. We provide a qualitative visualization of our algorithm in Figure 6. More results are provided in the supplementary, and we encourage the reader to view them at full size on a screen.

5 Conclusion

We have presented a novel deep neural network, Multi-Stage Attentional U-Net (MSAU), for end-to-end information extraction on the 2D char-grid representation of the document. We incorporate attention mechanism, box convolution with the multi-stage encoder-decoder architecture to handle complex textual and spatial relation in the char-grid. Moreover, we propose a multi-task training scheme to improve the model performance on practically challenging scenarios. Experiments on two benchmark datasets have demonstrated the effectiveness of our approach. In the future work, we wish to open-source our datasets and provide the proposed methods as new baselines to promote the current active research in document analysis area.
Acknowledgement. The authors wish to offer a sincere thanks to Cinnamon AI Labs for providing the required resources to complete this work. We also express our deep gratitude to Tran Minh Quan ([email protected]) and Nghiem Nguyen Viet Dung ([email protected]) for their valuable aids in the final preparation of the paper.

References

[Aldavert et al.(2017)Aldavert, Rusinol, and Toledo] David Aldavert, Marcal Rusinol, and Ricardo Toledo. Automatic Static/Variable Content Separation in Administrative Document Images. Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, 1:87–92, 2017. ISSN 15205363.
[Bojanowski et al.(2016)Bojanowski, Grave, Joulin, and Mikolov] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching Word Vectors with Subword Information. Analytical Methods, 5(3):729–734, jul 2016. ISSN 17599679.
[Bukhari et al.(2011)Bukhari, Shafait, and Breuel] Syed Saqib Bukhari, Faisal Shafait, and Thomas M Breuel. High Performance Layout Analysis of Arabic and Urdu Document Images. Icdar, pages 1275–1279, 2011.
[Burkov(2018)] Egor Burkov. Deep Neural Networks using Box Convolutions. (NeurIPS):1–11, 2018.
[Chakrabarti et al.(2008)Chakrabarti, Kumar, and Punera] Deepayan Chakrabarti, Ravi Kumar, and Kunal Punera. A graph-theoretic approach to webpage segmentation. Proceeding of the 17th international conference on World Wide Web - WWW ’08, page 377, 2008.
[Chen et al.(2016)Chen, Papandreou, Kokkinos, Murphy, and Yuille] Liang-chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. pages 1–14, jun 2016.
[Co(2014)] Bertrand Co. Recognition of Tables and Forms. 2014. ISBN 9780857298591.
[Ghai and Jain(2013)] Deepika Ghai and Neelu Jain. Text Extraction from Document Images- A Review. International Journal of Computer Applications, 84(3):40–48, 2013.
[Hanchuan Peng et al.(2000)Hanchuan Peng, Fuhui Long, Wan-Chi Siu, Zheru Chi, and David Dagan Feng] Hanchuan Peng, Fuhui Long, Wan-Chi Siu, Zheru Chi, and David Dagan Feng. Document image matching based on component blocks. In Proceedings 2000 International Conference on Image Processing (Cat. No.00CH37101), pages 601–604 vol.2. IEEE, 2000.
[Hanchuan Peng et al.(2003)Hanchuan Peng, Fuhui Long, and Zheru Chi] Hanchuan Peng, Fuhui Long, and Zheru Chi. Document image recognition based on template matching of component block projections. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(9):1188–1192, sep 2003. ISSN 0162-8828.
[Katti et al.(2018)Katti, Reisswig, Guder, Brarda, Bickel, Höhne, and Faddoul] Anoop Raveendra Katti, Christian Reisswig, Cordula Guder, Sebastian Brarda, Steffen Bickel, Johannes Höhne, and Jean Baptiste Faddoul. Chargrid: Towards Understanding 2D Documents. 2018.
[Kise et al.(1999)Kise, Iwata, and Matsumoto] Koichi Kise, Motoi Iwata, and Keinosuke Matsumoto. On the application of Voronoi diagrams to page segmentation. … on Document Layout Interpretation …, pages 6–9, 1999. ISSN 10773142.
[Lin et al.(2017)Lin, Goyal, Girshick, He, and Dollar] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. Focal Loss for Dense Object Detection. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2999–3007. IEEE, oct 2017. ISBN 978-1-5386-1032-9.
[Long et al.(2015)Long, Shelhamer, and Darrell] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 31, pages 3431–3440. IEEE, jun 2015. ISBN 978-1-4673-6964-0.
[Mao et al.(2003)Mao, Rosenfeld, and Kanungo] Song Mao, Azriel Rosenfeld, and Tapas Kanungo. Document structure analysis algorithms: a literature survey. pages 197–207, jan 2003.
[Mikolov et al.(2013)Mikolov, Sutskever, Chen, Corrado, and Dean] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. Literatura de viajes y traducción, pages 319–333, oct 2013. ISSN 10495258.
[Namboodiri and Jain(2007)] Anoop M. Namboodiri and Anil K. Jain. Document Structure and Layout Analysis. pages 29–48, 2007. ISSN 03029743.
[Oliveira et al.(2018)Oliveira, Seguin, and Kaplan] Sofia Ares Oliveira, Benoit Seguin, and Frederic Kaplan. dhSegment: A generic deep-learning approach for document segmentation. 2018.
[Oro and Ruffolo(2009)] Ermelinda Oro and Massimo Ruffolo. PDF-TREX: An approach for recognizing and extracting tables from PDF documents. Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, pages 906–910, 2009. ISSN 15205363.
[Palm et al.(2018)Palm, Laws, and Winther] Rasmus Berg Palm, Florian Laws, and Ole Winther. Attend, Copy, Parse - End-to-end information extraction from documents. dec 2018.
[Papandreou et al.(2015)Papandreou, Kokkinos, and Savalle] George Papandreou, Iasonas Kokkinos, and Pierre-Andre Savalle. Modeling local and global deformations in Deep Learning: Epitomic convolution, Multiple Instance Learning, and sliding window detection. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 390–399. IEEE, jun 2015. ISBN 978-1-4673-6964-0.
[Pennington et al.(2014)Pennington, Socher, and Manning] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), volume 37, pages 1532–1543, Stroudsburg, PA, USA, jul 2014. Association for Computational Linguistics. ISBN 9781937284961.
[Ronneberger et al.(2015)Ronneberger, Fischer, and Brox] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. CoRR, 2015.
[Rusinol et al.(2013)Rusinol, Benkhelfallah, and Dandecy] Marcal Rusinol, Tayeb Benkhelfallah, and Vincent Poulain Dandecy. Field extraction from administrative documents by incremental structural templates. Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, pages 1100–1104, 2013. ISSN 15205363.
[Simoes, Goncalo; Galhardas, Helena; Coheur(2005)] Luisa Simoes, Goncalo; Galhardas, Helena; Coheur. Information Extraction tasks : a survey. page 13, 2005.
[Smith(2007)] R. Smith. An Overview of the Tesseract OCR Engine. In Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) Vol 2, pages 629–633. IEEE, sep 2007. ISBN 0-7695-2822-8.
[Tang et al.(2018)Tang, Peng, Geng, Zhu, and Metaxas] Zhiqiang Tang, Xi Peng, Shijie Geng, Yizhe Zhu, and Dimitris N. Metaxas. CU-Net: Coupled U-Nets. aug 2018. ISSN 1053-0487.
[Tieleman and Hinton(2012)] T. Tieleman and G. Hinton. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude, 2012.
[Wang et al.(2004)Wang, Phillips, and Haralick] Yalin Wang, Ihsin T. Phillips, and Robert M. Haralick. Table structure understanding and its performance evaluation. Pattern Recognition, 37(7):1479–1497, 2004. ISSN 00313203.
[Yang et al.(2017)Yang, Yumer, Asente, Kraley, Kifer, and Giles] Xiao Yang, Ersin Yumer, Paul Asente, Mike Kraley, Daniel Kifer, and C. Lee Giles. Learning to extract semantic structure from documents using multimodal fully convolutional neural networks. In Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, 2017. ISBN 9781538604571.
[Zhang et al.(2018)Zhang, Goodfellow, Metaxas, and Odena] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-Attention Generative Adversarial Networks. may 2018.

End-to-End Information Extraction by Character-Level Embedding and Multi-Stage Attentional U-Net