Jaeger: A Concatenation-Based Multi-Transformer VQA Model

Jieting Long The University of SydneySydneyNSWAustralia2050 [email protected] , Zewei Shi The University of SydneySydneyNSWAustralia2050 [email protected] , Penghao Jiang The University of SydneySydneyNSWAustralia2050 [email protected] and Yidong Gan The University of SydneySydneyNSWAustralia2050 [email protected]

(2023)

Abstract.

Document-based Visual Question Answering poses a challenging task between linguistic sense disambiguation and fine-grained multimodal retrieval. Although there has been encouraging progress in document-based question answering due to the utilization of large language and open-world prior models(Cheng et al., 2023), several challenges persist, including prolonged response times, extended inference durations, and imprecision in matching. In order to overcome these challenges, we propose Jaegar, a concatenation-based multi-transformer VQA model. To derive question features, we leverage the exceptional capabilities of RoBERTa large(Liu et al., 2019) and GPT2-xl(Radford et al., 2019) as feature extractors. Subsequently, we concatenate the outputs from both models. This operation allows the model to consider information from diverse sources concurrently, strengthening its representational capability. By leveraging multiple pre-trained models for feature extraction, our approach has the potential to amplify their performance through concatenation. After concatenation, we apply dimensionality reduction to the output features, reducing the model’s computational overhead and inference time. Empirical results demonstrate that our proposed model achieves competitive performance on Task C of the PDF-VQA Dataset. If the user adds any new data, they should make sure to style it as per the instructions provided in previous sections.

Document-based Visual Question Answering, Large Language Model, Concatenation Operation.

^†^†journalyear: 2023^†^†copyright: acmlicensed^†^†conference: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management; October 21–25, 2023; Birmingham, United Kingdom^†^†booktitle: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (DocIU Workshop, CIKM ’23), October 21–25, 2023, Birmingham, United Kingdom^†^†doi: https://doi.org/XXXXXX.XXXX^†^†ccs: Information systems Information retrieval^†^†ccs: Computing methodologies Artificial intelligence

1. Introduction

Document-based Visual Question Answering has various applications in the fields of biomedical science, business, and education. However, addressing the accuracy of answering questions and the inference time has always been crucial. To address this challenge, we leverage large language models’ robust feature extraction capabilities to identify the correct answers to questions precisely. We aim to develop a Document-based Visual Question-answering model that relies on large language models to minimize inference time and reduce computational resource demands. Previous studies have investigated different approaches to Document-based Visual Question Answering, most of which utilized CNN and LSTM models to specifically address questions about image content(Antol et al., 2015). However, these models faced difficulties in handling complex relationships and contextual information, resulting in below-average performance on intricate questions. As attention mechanism techniques have advanced, more studies have chosen to incorporate attention mechanisms into models to improve their interpretability and performance. This allows a model to automatically focus on question-relevant areas, enhancing its ability to handle complex queries(Yang et al., 2016). However, this also significantly increases the complexity and computational requirements of the model. Subsequently, the emergence of multi-modal models introduced a new research direction. Multi-modal models can capture intricate relationships between images and text, often outperforming single-modal models(Li et al., 2020). Nonetheless, most multi-modal models require substantial amount of training data to avoid overfitting. When abundant data is unavailable, multi-modal models may have limited applicability to general downstream tasks. We summarize our contributions as follows:

•

We introduce Jaeger, a novel model for document-based visual question answering model.
•

We design a concatenation strategy to enhance the model’s performance and presentation capabilities.
•

We demonstrate that Jaeger can output competitive results on multiple tasks.

2. Related Work

Large language models are pretrained on vast amounts of text data, exhibiting comprehensive understanding of natural language. Existing studies have introduced advanced training techniques to optimize their ability to generalize across diverse tasks. Johnson et al. propose adaptive training techniques for large language models and various architecture improvements. They further leverage the potential of transfer learning, showing significant results in downstream tasks like text classification and sentiment analysis. These recent advancements pave the way for the extensive application of large language models in various domains, including machine translation(Vaswani et al., 2017), chatbots(Zhang et al., 2018), text summarization(Liu and Lapata, 2019) , code generation(Chen et al., 2018), medical diagnosis from textual data, and more. The ubiquity and versatility of such models underline their importance in the current AI landscape.

3. Methodology

3.1. Problem Definition

Hierarchical relationship understanding stands as one of the key challenges in PDF VQA tasks. The primary objective is to augment comprehension at the document level, focusing on two question types: understanding parent/child relationships. That is, all contents hierarchically related to the queried item in the question are expected to be identified(Ding et al., 2023).

3.2. Framework

Refer to caption — Figure 1. Model Architecture. All encoders used are pretrained models (i.e., pretrained RoBERTa-large, pretrained GPT2-xl, pretrained BERT, and pretrained ResNet-101). Qfeat is the result of concatenating Qfeat_1 and Qfeat_2 which are the last hidden state from the corresponding encoder.

Our framework, Jaeger, is dedicated to acquiring distinct and valuable representations through the use of pretrained models. It encompasses the extraction of both textual and visual features, as shown in Figure 1. Textual features are obtained from two perspectives: (1) questions and (2) the primary content, while visual features are derived from region-of-interest areas within each page. Regarding questions, we leverage two pretrained large language models (i.e., RoBERTa-large and GPT2-xl) to extract features, concatenating them to capture information from different aspects effectively. For PDFs, we access content text incorporating positional information after tokenization and region-of-interest identification. We then utilize pretrained models, specifically Bert and ResNet-101, to extract textual and visual features, respectively. This results in three distinct sets of features, each serving as a representation for a specific aspect of the problem.

4. Experiments

4.1. Dataset and Experimental Settings

We employ a publicly available PDF VQA dataset(Ding et al., 2023), which includes a collection of question-answers pairs with specified question types, textual content, positional information, and each PDF page is represented as an image. Our chosen evaluation metric is Exact Matching Accuracy (EMA), which considers a prediction as accurate if only if it matches all ground truth answers for a given question. We utilize bert-base-uncased to tokenize and employ the SGD optimizer, setting the learning rate at 1e-06.

4.2. Baseline and Performance Comparison

We compare our model with three large visual-and-language pretrained models (VLPMs) - VisualBERT, ViLT, and LXMERT, alongside the state-of-the-art method (LoSpa) on the PDF-VQA challenge. The distinguishing factor of Jaeger lies in its question feature processing step, as opposed to all other methods, which process questions as sequences of question tokens encoded by pretrained BERT models(Ding et al., 2023). As depicted in Table 1, our Jaeger model demonstrates the highest performance when evaluated on both the validation and testing sets, outperforming all baseline models.

Model	Val	Test
VisualBERT(Li et al., 2019)	21.55	18.52
ViLT(Kim et al., 2021)	10.21	9.87
LXMERT(Tan and Bansal, 2019)	16.37	14.41
LoSpa(Ding et al., 2023)	30.21	28.99
Our Jaeger	35.87	33.63

Table 1. Performance Comparison Using the EMA Metric

5. Conclusion and Future Work

In this paper, we have introduced Jaeger, a novel model for document based VQA tasks that leverages the robust feature extraction capabilities of large language models. Through a carefully designed concatenation strategy, the model achieves a new state-of-the-art. Our work underscores the potential of combining state-of-the-art language models to address the challenges of document based VQA. As for future work, we will explore relational features extraction methods. We believe enhancing the model’s structural understanding at the document level will facilitate the identification of hierarchically related items, leading to a higher EMA. Additionally, we will also examine the number of large language models to be involved when extracting features. Furthermore, we plan to investigate new fine-tuning strategies tailored to our specific task, aiming to further boost performance while maintaining efficiency would also be valuable.

References

(1)
Antol et al. (2015) Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision. 2425–2433.
Chen et al. (2018) Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, Niki Parmar, Mike Schuster, Zhifeng Chen, et al. 2018. The best of both worlds: Combining recent advances in neural machine translation. arXiv preprint arXiv:1804.09849 (2018).
Cheng et al. (2023) Daixuan Cheng, Shaohan Huang, and Furu Wei. 2023. Adapting Large Language Models via Reading Comprehension. arXiv preprint arXiv:2309.09530 (2023).
Ding et al. (2023) Yihao Ding, Siwen Luo, Hyunsuk Chung, and Soyeon Caren Han. 2023. PDF-VQA: A New Dataset for Real-World VQA on PDF Documents. In Machine Learning and Knowledge Discovery in Databases: Applied Data Science and Demo Track, Gianmarco De Francisci Morales, Claudia Perlich, Natali Ruchansky, Nicolas Kourtellis, Elena Baralis, and Francesco Bonchi (Eds.). Springer Nature Switzerland, Cham, 585–601.
Kim et al. (2021) Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning. PMLR, 5583–5594.
Li et al. (2020) Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. 2020. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 11336–11344.
Li et al. (2019) Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019).
Liu and Lapata (2019) Yang Liu and Mirella Lapata. 2019. Text summarization with pretrained encoders. arXiv preprint arXiv:1908.08345 (2019).
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
Tan and Bansal (2019) Hao Tan and Mohit Bansal. 2019. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490 (2019).
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
Yang et al. (2016) Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 21–29.
Zhang et al. (2018) Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing dialogue agents: I have a dog, do you have pets too? arXiv preprint arXiv:1801.07243 (2018).