MATrIX - Modality-Aware Transformer for Information eXtraction
Abstract
We present MATrIX - a Modality-Aware Transformer for Information eXtraction in the Visual Document Understanding (VDU) domain. VDU covers information extraction from visually rich documents such as forms, invoices, receipts, tables, graphs, presentations, or advertisements. In these, text semantics and visual information supplement each other to provide a global understanding of the document. MATrIX is pre-trained in an unsupervised way with specifically designed tasks that require the use of multi-modal information (spatial, visual, or textual). We consider the spatial and text modalities all at once in a single token set. To make the attention more flexible, we use a learned modality-aware relative bias in the attention mechanism to modulate the attention between the tokens of different modalities. We evaluate MATrIX on 3 different datasets each with strong baselines.
1 Introduction
The worldwide number of digitized documents is in the trillions according to Adobe, the company that developed the PDF format. Some of these documents, like scanned books or reports, mostly contain pages of linear text. However, a large portion of these documents are visually rich, containing forms, tables, or where the layout itself conveys important clues about the meaning of the text. Automating form processing, filling databases, processing invoices, parsing census information, or normalizing form values across heterogeneous layouts require performing complex information extraction tasks such as:
-
•
Document classification
-
•
Sequence labeling, which consists of extracting groups of words in a document and labeling them.
-
•
Table detection, where one tries to extract the location and structure of tabular data embedded in a document such as utility bills or financial reports
-
•
Form recognition, which consists of extracting key-value pairs in documents like names, phone numbers, or order numbers.
To obtain good performance on these tasks, using the text modality alone is not sufficient[22], especially when the reading order is non-trivial. Transformer models relying on self-attention mechanisms have become instruments of choice for modeling simply and efficiently multi-modal input as they operate on sets of undifferentiated tokens in their simplest form. Modality fusion is therefore the crux of the problem. Previous approaches such as LayoutLM[22] have simply concatenated vision and text tokens together. DocFormer[2] introduced a complex specifically designed attention layer that fuses text and vision modality by summing them together, with the caveat that text and vision modalities must have the same length. Our approach aims at striking a better trade-off in complexity and efficiency by introducing a new modality-aware relative attention mechanism that allows individual tokens to modulate the amount of attention afforded to other tokens, conditioned on their own and their target tokens modality and relative spatial information. We also introduce a novel pre-training task: line bounding box regression to incentivize word tokens from the same lines to attend to each other visually and semantically with respect to their spatial location. We depart from the other works by using whole word tokens instead of sub-word tokens for our text modality representation by using the average embedding of the sub-words for each word. This allows us to work with much longer text lengths. Another key difference is that we represent all spatial information in the relative domain of the non-padded vision input instead of using absolute pixel coordinate embedding, which allows us to use arbitrary resolution at inference time.
2 Related work
Information Extraction (IE) is the task of extracting structured information from an unstructured source[17], with early approaches relying on pattern matching and natural language processing to identify information based on its position[3][4]. With the emergence of deep learning approaches in NLP[6], Document Understanding (DU) and its associated sub-tasks: key information extraction (KIE)[18][12], document layout analysis (DLA)[1][26][14], and document question answering (DQA)[16] took the forefront.
Visual Document Understanding (VDU) is a subfield of Document Understanding that combines features from text and image to extract information from structured documents. Yang et al.[24] were the first to introduce an end-to-end multi-modal approach for semantic structure labeling using a convolutional neural network and pre-trained natural language processing models. Devlin et al.[7] then introduced BERT and showcased the potential of masked language-model pre-training on NLP tasks. Lu et al.[15] introduced a co-attention mechanism that allows for the joint representation of image and text. LayoutLM[22] showed that a BERT[7] architecture combined to embeddings from Faster R-CNN could leverage bigger datasets using pre-training tasks to achieve SOTA. Shortly after, LayoutLMv2[23] improved upon LayoutLM by using a single pre-training framework and combining the visual features and the text tokens at the input of the model.
Powalski et al.[19] introduced TILT, an encoder-decoder approach, inspired from T5[20] which inputs the pairwise distances at the decoder level as a spatial bias.
DocFormer[2] introduced an attention layer to fuse visual, textual and spatial features, but also a formal definition for visual and textual feature fusion approaches.
Additionally, BROS[10] also used a BERT architecture with 2D spatial embeddings but forgoes transformer-based decoders in favour of SPADE[11] to model more complex spatial relationships.
3 Approach
3.1 Model Architecture
Modality-Aware Transformer for Information eXtraction or MATrIX for short is an end-to-end trained encoder-only transformer architecture that relies on a pre-trained backbone for visual feature extraction. Following the conceptual categories introduced by Appalaraju et al.[2] our approach is a joint multi-modal architecture in which the vision and language features are concatenated before applying the attention-based fusion layers.

3.1.1 Text features
Following prior work[22][23][2], we tokenize the input words using WordPiece[21]. Given a word , the tokenizer will output tokens which are then passed through an embedding , giving representations which are averaged at the word-level to create the encoder input text token . Formally:
(1) | |||
(2) | |||
(3) |
This approach allows us to train with a maximum sequence length of 512 word tokens as opposed to 512 sub-word tokens in previous work, allowing us to process entire, large documents with their entire text content.
3.1.2 Visual features
To enrich the document representation we add visual tokens to the input. Given an image, we resize the long side to 512 pixels, preserving the aspect ratio, we define a reduction ratio of 32 which gives us a maximum of 16x16 visual tokens. In a departure from previous work, we do not enforce a visual token sequence length and we can handle arbitrary resolutions given padding to the nearest reduction ratio multiple. To each vision token, we add a spatial embedding based on its feature map element size with respect to the non-padded image area. The weights of the visual-spatial embedding layer are shared with the text spatial embedding.
3.1.3 Spatial features
To each text and visual token, we add a spatial embedding to provide the encoder with 2D spatial information. As opposed to DocFormer which embedded the bounding box and LayoutLMv2 which used two embeddings, we encode the token index, top-corner and relative positions, and the relative width and height of the token. Formally . The embedding itself is a 3-layer feed-forward network with leaky ReLU activation.
We use the same embedding for visual and text tokens. The intuition behind this choice is that both types of features exist in the same 2D space.
3.1.4 Relative Attention
Vision and text features are usually heavily spatially correlated in VDU. In order to include this inductive bias, LayoutLMv2[23] introduced a learn-able relative attention bias, learned independently for each attention head.
(4) |
This introduces a simple mechanism for the model to increase the self-attention of close-by tokens and reduce the attention of far-away tokens. We modified slightly this implementation to condition the relative attention bias jointly on the and the axis instead of two independent biases.
(5) |
3.1.5 Modality-Aware Relative Attention
One of the difficult challenges for a multi-modal transformer is to build token representations that can be compared and combined across modalities through self-attention. Even if vision and text modalities both belong in the 2D space and share a common spatial representation, it doesn’t mean that their spatial correlations are identical. For example, vision features might help the model uniformly around a given text token, while for text, the 1D ordering of a sentence that is over-represented in the axis relationship, is the most meaningful. We adapt the previously described relative attention bias by conditioning the learned bias based on the pair-wise modality of the tokens being considered.
(6) |
3.2 Pre-training Tasks




3.2.1 Line Regression
In this novel task, we infer the line bounding box of the input text tokens. Given the output of the encoder, we apply a linear projection for each text token to predict its line coordinates as . The concept of line is defined as returned by the OCR engine which is a group of words separated by spaces. We use a mean-square error loss denoted .
(7) |
The model can achieve this objective only by learning to group words together semantically and visually to combine their spatial information in order to output the correct line for each word in a given line.
3.2.2 Token Switch
In this novel VDU task, we switch text tokens in the input and use a projection to predict whether a token was switched. We use this task to encourage the model to learn a richer semantic understanding of the text sequence. We optimize the objective using a cross-entropy loss represented by .
3.2.3 Multi-Modal Masked Language Modeling
In this task, we try to predict masked text tokens. We mask or replace tokens in the original sequence and use the encoder to predict the correct token at a given position. In a similar fashion to DocFormer[2], we use a modified formulation of the original BERT[7] MM-MLM in which the visual region is not masked, as opposed to prior work by Xu et al.[22][23]. Our implementation differs from DocFormer as it accounts for the mean bag that is applied to the output of the tokenizer. We apply an L1 loss between the input and the output text embeddings, however, this approach is prone to collapsing as the embedding converges to a null vector . To prevent this side-effect, we rely on the token switch task which forces the model to converge towards a solution that also allows for the differentiation of the text embeddings and thus makes the solution invalid.
3.2.4 Learn To Reconstruct
In this task, introduced by Appalaraju et al.[2] we seek to reconstruct the input image from the multi-modal representation generated by the encoder. We use a simplistic decoder based on convolutional and up-sampling layers to infer the input image. We compute the error using a smooth L1 loss denoted as . This task is expected to force the model to leverage textual features to reconstruct characters from the original image.


(8) |
3.2.5 Text Describe Image
In this task, we create mismatched image and text modality pairs and introduce them into our dataset with a 20% sampling rate. The encoder is then tasked with identifying mismatched pairs. We add this task because requires a global understanding of the document as opposed to the other pre-training tasks which rely on local features. We use a cross-entropy loss denoted as .
3.2.6 Line Redaction
In this task, introduced by Xu et al.[23] we redact randomly selected lines from the input image by setting the pixels in the region to 0. The encoder is then tasked with classifying whether a given token was redacted in the input image. With this task, we seek to instill a strong sense of the spatial relationship between text and image. We use a cross-entropy loss denoted as .
The final loss for pre-training is as follow:
(9) |
With . The coefficient prevent early divergence by over-weighting line regression and under-weighting line redaction.
4 Experiments
4.1 Pre-training Experimental Setup
We used a subset of the IIT-CDIP document collection[13] containing 5 million scanned pages for pre-training for which we extracted the text, word-level, and line-level bounding boxes using Textract OCR.
For the visual feature extraction backbone we use a pre-trained Twins[5] model. Through their experiments, they showed that vision transformers trained with spatial attention can outperform CNN-based approaches in various dense detection tasks. We use the pre-trained Twins SVT base model and allow backpropagation to update the weights during training.
4.2 Entity Extraction Tasks
We evaluate our approach on two different datasets.
4.2.1 CORD
COnsolidated Receipt Dataset for Post-OCR Parsing[18] is a receipt dataset containing a total of 1000 samples with 800 for training, 100 for validation, and 100 for test. It defines 30 classes grouped in 5 superclasses. F1 score is reported on the 30 classes classification task.
4.2.2 FUNSD
Form Understanding in Noisy Scanned Documents[8] is a form dataset with a total of 199 samples, making it much smaller than other datasets we evaluated. 149 samples are used for training and the reported F1-score is reported on the remaining 50. We evaluate entire documents and do not truncate them to the first 512 sub-word tokens.
We report our results in Table 1. We achieve a competitive F1 score on CORD, falling short of the previous SOTA by 0.94 with MATrIX which has about 3 times fewer parameters than LayoutLMv2-large and DocFormer-large. We do not achieve a high F1 score on FUNSD, which we attribute to its smaller number of samples.
Model | #param (M) | FUNSD | CORD | |
---|---|---|---|---|
LayoutLMv1-base | 160 | 79.27 | - | |
LayoutLMv1-large | 390 | 77.89 | 94.93 | |
LayoutLMv2-base | 200 | 82.76 | 94.95 | |
TILT-base | 230 | - | 95.11 | |
LayoutLMv2-large | 426 | 84.20 | 96.01 | |
TILT-large | 780 | - | 96.33 | |
DocFormer-base | 183 | 83.34 | 96.33 | |
DocFormer-large | 533 | 84.55 | 96.99 | |
MATrIX (ours) | 166 | 78.60 | 96.05 |
4.3 Document Classification Task
This task consists in predicting a single label or class for a document page. RVL-CDIP[9] is a dataset containing 400,000 images equally split across 16 classes. 320,000 samples are used for training, with the remaining 80,000 being equally split between the validation and test sets. The classification accuracy results are computed on the test set. Following prior work[2][23][10], text and spatial information is extracted using Textract OCR. We do not filter on word count and evaluate the entire test set.
We report our results in Table 2.
Model | #param (M) | Accuracy |
---|---|---|
TILT-base | 230 | 93.50 |
TILT-large | 780 | 94.02 |
LayoutLMv1-base | 160 | 94.42 |
LayoutLMv1-large | 390 | 94.43 |
LayoutLMv2-base | 200 | 95.25 |
LayoutLMv2-large | 426 | 95.65 |
DocFormer-base | 183 | 96.17 |
DocFormer-large | 533 | 95.50 |
MATrIX (ours) | 166 | 94.20 |
4.4 Ablation Study
We conduct an extensive ablation study using the CORD dataset.
4.4.1 Impact of modality-aware relative attention
We conduct an ablation study to determine the impact of using pre-trained BERT weights for the attention layer and sub-word token embeddings, and modality-aware relative attention on the final results for the CORD downstream task. This shows that modality-aware relative attention offers a significant improvement over regular multi-modal self-attention.
Approach | CORD (F1) |
---|---|
Base | 95.05 |
Base + BERT | 95.19 (+0.14) |
Base + MATrIX | 95.48 (+0.43) |
Base + BERT + MATrIX | 96.05 (+1.00) |
4.4.2 Impact of pre-training tasks
We conduct an ablation study to determine the impact of each pre-training task on the final results for the CORD downstream task. To minimize resource usage, these pre-trainings only ran for a single epoch on the 5M dataset. In table 4, MM-MLM was always trained with the token switch task to prevent collapse. Appalaraju et al.[2] showed that the learn to reconstruct and text describe image tasks were beneficial for this task, therefore we attribute this regression to insufficient training.
Pre-training task | CORD (F1) |
---|---|
MM-MLM* | 94.52 |
+ Learn to Reconstruct + Describe Image | 93.83 (-0.69) |
+ Line Regression | 95.55 (+1.72) |
+ Line Redaction | 95.31 (-0.24) |
4.4.3 Impact of modalities
We conduct an ablation study to analyze the impact of each modality on the final results for each downstream task. We do not do additional pre-training and simply input 0 values for the missing modalities during the fine-tuning step on CORD. This shows that the additional modalities are being leveraged.
Modality | CORD (F1) | |
---|---|---|
Text-only | 82.18 | |
+ spatial relative attention bias | 93.80 (+11.62) | |
+ spatial features embeddings | 95.38 (+1.58) | |
+ visual features | 96.05 (+0.67) |
4.4.4 Impact of input resolution
We conduct an ablation study to measure the impact of increasing and decreasing the input resolution when fine-tuning, given an encoder pre-trained at a maximum resolution of 512x512. We observe indeed that the model is resilient to resolution change.
Input resolution | Maximum visual tokens | CORD (F1) |
---|---|---|
256 | 64 (8x8) | 95.83 |
512 | 256 (16x16) | 96.05 |
768 | 576 (24x24) | 96.05 |
5 Conclusion
In this work, we introduced MATrIX, a multi-modal transformer for information extraction in the visual document understanding field. We presented a novel yet simple modality-aware relative attention mechanism which we demonstrated to be superior to the simpler relative attention and reached competitive results on three datasets. Additionally, we showed that using word-level tokens is a practical way of working with longer text sequences. We introduced line regression and token switch, two new cross-modality tasks which were shown to improve F1 score on the CORD dataset.
In future work, we will explore MATrIX suitability as a backbone for more complex tasks such as table detection, table structure extraction, and entity-relationship prediction.
6 Supplemental
6.1 Implementation details
The pre-training was done on 8x V100 with 32GB of memory using the LAMB[25] algorithm which was shown to decrease the training time of transformer architecture significantly. We used the implementation available in the Nvidia Apex library. The fine-tuning for the FUNSD and CORD downstream tasks were done on a single V100 with 16GB of memory.
Hyper-parameter | Pre-training | Fine-tuning | ||
---|---|---|---|---|
CORD | FUNSD | RVL-CDIP | ||
Epochs | 5 | 100 | 100 | 5 |
Learning rate | 0.0001 | 0.0001 | 0.00005 | 0.0001 |
Warm-up | 60 000 | 1000 | 1000 | 60 000 |
Gradient Clipping | 2.0 | 4.0 | 4.0 | 2.0 |
Optimizer | FusedLAMB | AdamW | AdamW | AdamW |
Lower case | Yes | Yes | Yes | Yes |
Sequence length | 512 | 512 | 512 | 512 |
Encoder layers | 12 | 12 | 12 | 12 |
Batch size | 8 | 4 | 1 | 8 |
GPU | V100 (32GB) | V100 (16GB) | V100 (16GB) | V100 (32GB) |
References
- [1] Apostolos Antonacopoulos, David Bridson, Christos Papadopoulos, and Stefan Pletschacher. A realistic dataset for performance evaluation of document layout analysis. 2009 10th International Conference on Document Analysis and Recognition, pages 296–300, 2009.
- [2] Srikar Appalaraju, Bhavan Jasani, Bhargava Urala Kota, Yusheng Xie, and R. Manmatha. Docformer: End-to-end transformer for document understanding. CoRR, abs/2106.11539, 2021.
- [3] Mary Elaine Califf and Raymond J. Mooney. Relational learning of pattern-match rules for information extraction. In CoNLL97: Computational Natural Language Learning, 1997.
- [4] Claire Cardie. Empirical methods in information extraction. AI Magazine, 18(4):65, Dec. 1997.
- [5] Xiangxiang Chu, Zhi Tian, Yuqing Wang, Bo Zhang, Haibing Ren, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. Twins: Revisiting spatial attention design in vision transformers. CoRR, abs/2104.13840, 2021.
- [6] Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, page 160–167, New York, NY, USA, 2008. Association for Computing Machinery.
- [7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018.
- [8] Jean-Philippe Thiran Guillaume Jaume, Hazim Kemal Ekenel. Funsd: A dataset for form understanding in noisy scanned documents. In Accepted to ICDAR-OST, 2019.
- [9] Adam W Harley, Alex Ufkes, and Konstantinos G Derpanis. Evaluation of deep convolutional nets for document image classification and retrieval. In International Conference on Document Analysis and Recognition (ICDAR).
- [10] Teakgyu Hong, Donghyun Kim, Mingi Ji, Wonseok Hwang, Daehyun Nam, and Sungrae Park. BROS: A layout-aware pre-trained language model for understanding documents. CoRR, abs/2108.04539, 2021.
- [11] Wonseok Hwang, Jinyeong Yim, Seunghyun Park, Sohee Yang, and Minjoon Seo. Spatial dependency parsing for 2d document understanding. CoRR, abs/2005.00642, 2020.
- [12] Guillaume Jaume, Hazim Kemal Ekenel, and Jean-Philippe Thiran. FUNSD: A dataset for form understanding in noisy scanned documents. CoRR, abs/1905.13538, 2019.
- [13] David D. Lewis, Gady Agam, Shlomo Argamon, Ophir Frieder, David A. Grossman, and Jefferson Heard. Building a test collection for complex document information processing. In SIGIR, pages 665–666, 2006.
- [14] Minghao Li, Yiheng Xu, Lei Cui, Shaohan Huang, Furu Wei, Zhoujun Li, and Ming Zhou. Docbank: A benchmark dataset for document layout analysis. CoRR, abs/2006.01038, 2020.
- [15] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. CoRR, abs/1908.02265, 2019.
- [16] Minesh Mathew, Dimosthenis Karatzas, R. Manmatha, and C. V. Jawahar. Docvqa: A dataset for VQA on document images. CoRR, abs/2007.00398, 2020.
- [17] Christina Niklaus, Matthias Cetto, André Freitas, and Siegfried Handschuh. A survey on open information extraction. In Proceedings of the 27th International Conference on Computational Linguistics, pages 3866–3878, Santa Fe, New Mexico, USA, Aug. 2018. Association for Computational Linguistics.
- [18] Seunghyun Park, Seung Shin, Bado Lee, Junyeop Lee, Jaeheung Surh, Minjoon Seo, and Hwalsuk Lee. Cord: A consolidated receipt dataset for post-ocr parsing. 2019.
- [19] Rafal Powalski, Lukasz Borchmann, Dawid Jurkiewicz, Tomasz Dwojak, Michal Pietruszka, and Gabriela Palka. Going full-tilt boogie on document understanding with text-image-layout transformer. CoRR, abs/2102.09550, 2021.
- [20] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR, abs/1910.10683, 2019.
- [21] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR, abs/1609.08144, 2016.
- [22] Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. Layoutlm: Pre-training of text and layout for document image understanding. CoRR, abs/1912.13318, 2019.
- [23] Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei A. F. Florêncio, Cha Zhang, Wanxiang Che, Min Zhang, and Lidong Zhou. Layoutlmv2: Multi-modal pre-training for visually-rich document understanding. CoRR, abs/2012.14740, 2020.
- [24] Xiao Yang, Mehmet Ersin Yümer, Paul Asente, Mike Kraley, Daniel Kifer, and C. Lee Giles. Learning to extract semantic structure from documents using multimodal fully convolutional neural network. CoRR, abs/1706.02337, 2017.
- [25] Yang You, Jing Li, Jonathan Hseu, Xiaodan Song, James Demmel, and Cho-Jui Hsieh. Reducing BERT pre-training time from 3 days to 76 minutes. CoRR, abs/1904.00962, 2019.
- [26] Xu Zhong, Jianbin Tang, and Antonio Jimeno-Yepes. Publaynet: largest dataset ever for document layout analysis. CoRR, abs/1908.07836, 2019.