TabIQA: Table Questions Answering on
Business Document Images
Abstract
Table answering questions from business documents has many challenges that require understanding tabular structures, cross-document referencing, and additional numeric computations beyond simple search queries. This paper introduces a novel pipeline, named TabIQA, to answer questions about business document images. TabIQA combines state-of-the-art deep learning techniques 1) to extract table content and structural information from images and 2) to answer various questions related to numerical data, text-based information, and complex queries from structured tables. The evaluation results on VQAonBD 2023 dataset demonstrate the effectiveness of TabIQA in achieving promising performance in answering table-related questions. The TabIQA repository is available at https://github.com/phucty/itabqa.
Index Terms:
Visual Question Answering, Table Question Answering, Business DocumentsI Introduction
Businesses generate and process vast amounts of information, and extracting valuable insights from this data is crucial for making informed decisions. Business documents, such as financial reports, invoices, and contracts, often contain valuable information in tabular form. However, answering questions based on these document images can be challenging due to their complex structures, cross-referencing, and numerical computations beyond simple search queries.
Traditional information retrieval methods, such as keyword search and regular expressions, are only sometimes effective in retrieving information from tables in business documents. Therefore, there is a growing need for automated approaches that can accurately extract relevant information from tables and answer various questions. Such approaches can save businesses time and effort and improve the extracted information’s accuracy and reliability.

The main objective of this paper is to introduce TabIQA, a novel pipeline for answering questions on business document images. Figure 1 illustrates the question-answering task from the business document image in VQAonBD 2023 dataset. Given a business document image and a question about the image: “What is the dollar value (in thousands) of foreign currency translation for the year 2013?” the output answer is “-37619”. TabIQA utilizes the table recognition module to extract table structure information and the text content of each table cell and convert them into HTML format. Subsequently, the high-level table structure is extracted to identify the headers, data cells, and hierarchical structure with the post-structure extraction module. Once the table is structured, it is converted to a dataframe format for further processing. The question-answering module processes the input question and the table dataframe with an encoder and generates the final answer from a decoder.
Overall, this study makes the following contributions:
-
•
Introducing TabIQA, a novel pipeline for answering questions about business document images: TabIQA is a comprehensive pipeline combining state-of-the-art deep learning techniques to extract relevant information from tables and answer various questions related to numerical data, text-based information, and complex queries.
-
•
Providing a publicly available repository: We have made the TabIQA repository publicly available to encourage the reproducibility of our results and enable other researchers to use and build upon our work.
-
•
Demonstrating the effectiveness of TabIQA on the VQAonBD 2023 dataset: The evaluation results on VQAonBD 2023 dataset111VQAonBD 2023: https://ilocr.iiit.ac.in/vqabd/dataset.html demonstrate the effectiveness of TabIQA in achieving promising performance in answering table-related questions.
The rest of the paper is structured as follows. Section III-A summarizes related work on question answering on business document images. We introduce the TabIQA method in Section III. Section IV presents the experimental settings and results. Finally, in Section V, we present conclusions and discuss future directions of the task table question answering on business document images.
II Related Work
Image-based table recognition is one of the important parts of the document understanding system as well as the table questions answering system, which aims to recognize the table structure information and the text content of each table cell from an input image and represent them in a machine-readable format (HTML or CSV). Most of the previous works [1, 2, 3, 4] of table recognition focused on two-step approaches that divide the problem into two sub-problems: table structure recognition and table cell content recognition, and then attempt to solve each sub-problem independently by two separate systems. In recent years, due to the advantages of deep learning and the availability of large-scale table image datasets, some works [5, 6, 7, 8] try to focus on end-to-end approaches which solve the table recognition problem using one end-to-end system. Ly et al. [7] formulated the problem of table recognition as a multi-task learning problem. They proposed an end-to-end multi-task learning model for image-based table recognition, which consists of three separate decoders for three sub-tasks of table recognition: table structure recognition, cell detection, and cell-content recognition. The proposed model achieves state-of-the-art accuracies on PubTabNet and FinTabNet datasets. Ly et al. [8] also proposed an end-to-end weakly supervised learning model named WSTabNet, which requires only table images and their annotations of table HTML code for training the model. WSTabNet achieves competitive accuracies compared to the fully supervised and two-step learning methods.
Information Retrieval from Business Documents Business documents such as invoices, receipts, and financial statements contain valuable information critical for decision-making and analysis. Traditional information retrieval methods from business documents rely on manual data entry or simple search queries, which can be time-consuming and error-prone. Recent advances in deep learning and natural language processing have enabled the development of automated systems that can extract information from business documents with high accuracy and efficiency.
Deep Learning Techniques for Table Extraction and Question Answering Table extraction and question answering are critical tasks in automated information retrieval from business documents. Deep learning techniques have shown great promise in addressing these challenges. Various approaches have been proposed for table extraction, including region-based, cell-based, and structure-based methods. For question answering, neural network-based models such as transformer-based models have achieved state-of-the-art results on various datasets.
While these studies have achieved promising results in automated information retrieval from business documents, there is still room for improvement in accuracy, efficiency, and scalability. This paper proposes a novel pipeline, TabIQA, for answering questions about business document tables that leverage state-of-the-art deep learning techniques for improved performance.
III Approach
This section describes TabIQA’s overall framework in Section III-A. The details of the table recognition module, post-structure extraction module, and question-answering module are described in Section III-B, III-C, and Section III-D respectively.
III-A Framework
The overall framework of TabIQA, a system designed for question-answering using table images in business documents, is illustrated in Fig. 2. TabIQA utilizes a table recognition algorithm that extracts the table’s structure and textual content of each cell and then converts them into HTML format. The system subsequently analyzes the HTML table to identify headers, data cells, and hierarchical structure and transforms it into a dataframe for further processing. Finally, the question-answering module processes the input question and the table dataframe with an encoder and generates the final answer through a decoder.

III-B Table Recognition
This module aims to predict the table structure information and the text content of each table cell from a table image and represent them in a machine-readable format (HTML). This module consists of one shared encoder, one shared decoder, and three separate decoders for three sub-tasks of table recognition: table structure recognition, cell detection, and cell-content recognition.
First, we trained this model on the training set of VQAonBD 2023 and validated it on the validation set for model selection and choosing the hyperparameters. Finally, we used training and validation sets to train the final table recognition module.
III-C Post Structure Extraction
The table post-structure extraction module plays a crucial role in the TabIQA system. The module’s primary function is to predict table headers and extract the hierarchical rows from the HTML table.
III-C1 Header Prediction
To predict table headers, the module uses a set of heuristics based on the characteristics of the input table. Specifically, the headers are identified as one of the first table rows that satisfy one or more of the following conditions:
-
•
Column spans: The header row contains cells that span multiple columns.
-
•
Nan cells: The header row contains cells with missing values.
-
•
Duplicate value cells: The header row contains cells with identical values in the same row.
-
•
If no headers are found using these heuristics, the first row is treated as the table header.
All the remaining rows are then classified as data rows.
III-C2 Hierarchical Row Prediction
This module predicts the hierarchical information from the table HTML and then concatenates the value of each hierarchical cell to the lower-level cells in the same column. In this work, we propose two hierarchical row prediction algorithms: the first one is based on the predicted table HTML and the second one is based on both the predicted table HTML and the table cell bounding boxes. The two hierarchical row prediction algorithms are defined in the algorithm 1 and 2, respectively:
III-D Question Answering
We adopt the state-of-the-art table-based QA setting OmniTab [9], based on TAPEX [10] pre-training setting. It feeds the concatenating token sequences of natural language questions and linearized table dataframes into the bidirectional encoder. The table dataframes are linearized in the order of top-to-bottom and left-to-right. The final answers are generated with an autoregressive decoder.
We create a new fine-tuning training set from the training set of VQAonBD 2023 and the results of table recognization and post-structure extraction modules. Each sample consists of a table dataframe and a natural language question, the equivalent ground truth answer. We fine-tuned the OmniTab large pre-trained models using a training set of 100K samples sampled from the training set of VQAonBD 2023, with 20k samples of each question category.
IV Experiments
IV-A Dataset
We evaluate TabIQA on VQAonDB 2023 dataset. This dataset contains document images from the FinTabNet dataset [11] and relevant questions about these document images. The training and validation set’s ground truth also contains table structure annotation information, i.e., bounding boxes of words, tokens, digitized text, and row and column identifiers.
Each document image may include up to 50 questions in five categories with their corresponding answers. The number of questions in each category for a single table varies based on its content and format. For Category 1, the number of questions falls within the range of 0 to 25, while for Category 2, it is between 0 and 10. Similarly, the number of questions in Category 3 ranges from 0 to 3, while for Category 4, it is between 0 and 7. Lastly, Category 5 has a range of 0 to 5 questions.
The detailed statistics of VQAonBD 2023 are described in the following sections.
IV-A1 Document Images
In this section, we analyze the document images of the VQAonBD 2023 dataset. The statistics of document images are reported in Table I. “Doc Images” are the number of document images, whereas the “Blank Images” are the number of the blank page. For example, the sample “val_table_image_9684__CL__2014__page_54_split_0” in the validation set of VQAonBD 2023 is the blank page.
The training contains 12% blank images within the dataset, whereas validation and testing sets have blank images at less than 1%. Despite the absence of content, related questions exist regarding these blank images. During TabIQA training phase, we exclude these samples containing blank images. In the testing phase, TabIQA returns a zero value for samples containing blank images.
Doc Images | Blank Images | |
---|---|---|
Train | 39,999 | 5,025 |
Validation | 4,535 | 6 |
Test | 4,361 | 13 |
IV-A2 Questions
The dataset used for training, validation, and testing contains 41,465, 1,254,165, and 135,825 questions, respectively. The average question length is 109.45 characters, and the average number of words in a question is 10.5. Some questions are longer than 1,500 characters. To identify named entities within the questions, we use the Spacy tool [12], which detects an average of 1.71 entities per question. Of these, 1.42 entities pertain to the time dimension, 0.18 are numerical values, and 0.12 have textual values.
IV-A3 Tables
Table II reports size statistics of annotated tables from the training and validation set of VQAonBD 2023. The training set comprises tables with numbers of rows larger than those in the validation set, and the cells in the training set are also longer than those in the validation set.
Train (avg.) | Val (avg.) | |
---|---|---|
Row | 2-77 (13.27) | 2 -58 (12.15) |
Column | 2-16 (4.57) | 2-13 (4.44) |
Cell length | 3.2-161.08 (11.3) | 4.56-40.54 (11.17) |
IV-B Experiment Setup
IV-B1 Baselines
IV-B2 Metrics
The VQAonBD 2023 performance metric is determined depending on answer types. If the answer types are textual values, then the Averaged Normalised Levenshtein Similarity (ANLS) is used as the metric in DocVQA [14]. ANLS is designed to respond softly to answer mismatches that may arise due to OCR imperfections. On the other hand, if the answer types are numerical values, then the metric is calculated by taking a scaled Euclidean norm of the ANLS score and the percentage of the absolute difference between the predicted answer and the ground truth answer.
IV-C Results and Discussions
Table III compares TabIQA’s question-answering performance against other baseline models using the VQAonBD 2023 dataset. We use the fine-tuned hugging face models222Huggingface models: https://huggingface.co/models?pipeline_tag=table-question-answering of TAPAS [13], TAPEX [10], and OmniTab [9] on the WikiTableQuestion dataset [15]. In addition, we included the Zero setting, a model that always returns zero for any question. TabIQA1 represents the setting where the question-answering model is fine-tuned directly on raw HTML tables. On the other hand, TabIQA2 refers to the setting where the QA model is fine-tuned on structured tables.
Model | VQAonBD 2023 Score |
---|---|
TAPAS [13] | 0.4138 |
TAPEX [10] | 0.4390 |
OmniTab [9] | 0.4421 |
Zero | 0.2616 |
TabIQA1 | 0.8808 |
TabIQA2 | 0.8997 |
Regarding question-answering performance, the scores indicate that TabIQA1 and TabIQA2 significantly outperform the other baseline models, i.e., TAPAS, TAPEX, OmniTab, and the Zero model. These results suggest that TabIQA’s fine-tuning the question-answering model on raw HTML tables or structured tables can significantly improve question-answering performance compared to other baseline models. It also indicates that TabIQA2 outperforms TabIQA1, which suggests that fine-tuning the QA model on structured tables can lead to better performance than fine-tuning on raw HTML tables. Overall, these results demonstrate the effectiveness of TabIQA in achieving high accuracy in answering table-related questions.
V Conclusion and Future Work
This paper aims to present a new pipeline called TabIQA to answer questions related to business document images. TabIQA employs cutting-edge deep learning methods in two stages. Firstly, it extracts both the content and structural information from images of tables. Secondly, it utilizes these features to answer questions about numerical data, text-based information, and complex queries from structured tables. Experimental results on VQAonBD 2023 dataset demonstrate that TabIQA can achieve promising performance in answering questions about tables.
We plan to extend the TabIQA pipeline to handle more complex queries that require reasoning over multiple tables or information from the document’s non-tabular parts. Another area for future work is to investigate the generalization capabilities of TabIQA to handle tables from different domains or document layouts. These are all potential avenues for future research that could enhance the capabilities and performance of TabIQA in real-world scenarios.
Acknowledgements
The research was supported by the Cross-ministerial Strategic Innovation Promotion Program (SIP) Second Phase, “Big-data and AI-enabled Cyberspace Technologies” by the New Energy and Industrial Technology Development Organization (NEDO).
References
- [1] L. Qiao, Z. Li, Z. Cheng, P. Zhang, S. Pu, Y. Niu, W. Ren, W. Tan, and F. Wu, “Lgpma: Complicated table structure recognition with local and global pyramid mask alignment,” 2021 International Conference on Document Analysis and Recognition (ICDAR), p. 99–114, 2021.
- [2] J. Ye, X. Qi, Y. He, Y. Chen, D. Gu, P. Gao, and R. Xiao, “Pingan-vcgroup’s solution for icdar 2021 competition on scientific literature parsing task b: Table recognition to html,” ArXiv, vol. abs/2105.01848, 2021.
- [3] A. Nassar, N. Livathinos, M. Lysak, and P. Staar, “Tableformer: Table structure understanding with transformers.” 2022 The Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- [4] Z. Zhang, J. Zhang, J. Du, and F. Wang, “Split, embed and merge: An accurate table structure recognizer,” Pattern Recognition, vol. 126, 2022.
- [5] D. Yuntian, R. David, and M. Gideon, “Challenges in end-to-end neural scientific table recognition,” 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 894–901, 2019.
- [6] X. Zhong, E. ShafieiBavani, and A. Jimeno Yepes, “Image-based table recognition: Data, model, and evaluation,” 2020 European Conference on Computer Vision (ECCV), p. 564–580, 2020.
- [7] N. T. Ly and A. Takasu, “An end-to-end multi-task learning model for image-based table recognition,” in Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 5: VISAPP 2023, pp. 626–634.
- [8] N. T. Ly, A. Takasu, P. Nguyen, and H. Takeda, “Rethinking image-based table recognition using weakly supervised methods,” in Proceedings of the 12th International Conference on Pattern Recognition Applications and Methods - ICPRAM 2023, pp. 872–880.
- [9] Z. Jiang, Y. Mao, P. He, G. Neubig, and W. Chen, “OmniTab: Pretraining with natural and synthetic data for few-shot table-based question answering,” in NAACL 2022, M. Carpuat, M. de Marneffe, and I. V. M. Ruíz, Eds., pp. 932–942.
- [10] Q. Liu, B. Chen, J. Guo, M. Ziyadi, Z. Lin, W. Chen, and J. Lou, “TAPEX: table pre-training via learning a neural SQL executor,” in ICLR 2022.
- [11] X. Zheng, D. Burdick, L. Popa, P. Zhong, and N. X. R. Wang, “Global table extractor (gte): A framework for joint table identification and cell structure recognition using visual context,” Winter Conference for Applications in Computer Vision (WACV), 2021.
- [12] M. Honnibal, I. Montani, S. Van Landeghem, and A. Boyd, “spaCy: Industrial-strength Natural Language Processing in Python,” 2020.
- [13] J. Herzig, P. K. Nowak, T. Müller, F. Piccinno, and J. M. Eisenschlos, “TaPas: weakly supervised table parsing via pre-training,” in ACL 2020, D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault, Eds., pp. 4320–4333.
- [14] M. Mathew, D. Karatzas, and C. V. Jawahar, “Docvqa: A dataset for VQA on document images,” in WACV, 2021. IEEE, pp. 2199–2208.
- [15] P. Pasupat and P. Liang, “Compositional semantic parsing on semi-structured tables,” in ACL 2015, pp. 1470–1480.