SynthDoc: Bilingual Documents Synthesis for Visual Document Understanding

Chuanghao Ding^*^† State Key Laboratory for Novel Software TechnologyNanjing UniversitySenseTime ResearchNanjingChina [email protected] , Xuejing Liu^* SenseTime ResearchShanghaiChina [email protected] , Wei Tang^† SenseTime ResearchShanghaiChina [email protected] , Juan Li State Key Laboratory for Novel Software TechnologyNanjing UniversityNanjingChina [email protected] , Xiaoliang Wang State Key Laboratory for Novel Software TechnologyNanjing UniversityNanjingChina [email protected] , Rui Zhao SenseTime ResearchShanghaiChina [email protected] , Cam-Tu Nguyen^‡ State Key Laboratory for Novel Software TechnologySchool of Artificial IntelligenceNanjing UniversityNanjingChina [email protected] and Fei Tan^‡ SenseTime ResearchShanghaiChina [email protected]

(2024)

Abstract.

This paper introduces SynthDoc, a novel synthetic document generation pipeline designed to enhance Visual Document Understanding (VDU) by generating high-quality, diverse datasets that include text, images, tables, and charts. Addressing the challenges of data acquisition and the limitations of existing datasets, SynthDoc leverages publicly available corpora and advanced rendering tools to create a comprehensive and versatile dataset. Our experiments, conducted using the Donut model, demonstrate that models trained with SynthDoc’s data achieve superior performance in pre-training read tasks and maintain robustness in downstream tasks, despite language inconsistencies. The release of a benchmark dataset comprising 5,000 image-text pairs not only showcases the pipeline’s capabilities but also provides a valuable resource for the VDU community to advance research and development in document image recognition. This work significantly contributes to the field by offering a scalable solution to data scarcity and by validating the efficacy of end-to-end models in parsing complex, real-world documents.

Visual Document Understanding, End-to-End Document Parsing, Synthetic Document Generation

^* Equal contribution.

^† Work was done during internship at SenseTime Research.

^‡ Corresponding author

^†^†journalyear: 2024^†^†copyright: acmlicensed^†^†conference: Proceedings of the 2nd Workshop on Large Generative Models Meet Multimodal Applications ; October 28-November 1 2024; Melbourne, VIC, Australia^†^†booktitle: Proceedings of the 2nd Workshop on Large Generative Models Meet Multimodal Applications (LGM3A ’24), October 28-November 1 2024, Melbourne, VIC, Australia^†^†doi: 10.1145/3688866.3689125^†^†isbn: 979-8-4007-1193-0/24/10^†^†ccs: Applied computing Multi / mixed media creation

1. Introduction

Visual Document Understanding (VDU) is a complex endeavor that seeks to decipher and interpret information from documents across a spectrum of formats and layouts (Harley et al., 2015; Jaume et al., 2019; Pfitzmann et al., 2022; Wang et al., 2021; Li et al., 2020). The objective of VDU is to develop algorithms capable of grasping the content, structure, and context of documents, thereby enabling tasks such as document classification (Harley et al., 2015), text detection (Jaume et al., 2019; Park et al., 2019), layout analysis (Pfitzmann et al., 2022; Wang et al., 2021), and object detection (Li et al., 2020, 2019).

Current research in VDU predominantly employs two methodologies: one (Xu et al., 2020a, b; Huang et al., 2022; Liao et al., 2023; Bai et al., 2022; Liu et al., 2023c) relies on OCR technology to convert document images into text for subsequent processing, while the other (Appalaraju et al., 2023; Kim1 et al., 2022; Lv et al., 2023; Mohamed et al., 2023; Lee et al., 2023; Liu et al., 2023a; Blecher et al., 2023) adopts an end-to-end approach, analyzing the document images directly. The pre-training and fine-tuning paradigm is extensively utilized in multimodal learning (Tang et al., 2024; Guo et al., 2024; Kim1 et al., 2022; Liu et al., 2023b; Blecher et al., 2023; Lee et al., 2023). The end-to-end approach leverages this paradigm to incorporate robust text recognition capabilities into the model, addressing the limitations of OCR accuracy and achieving high processing efficiency. A common pre-training task is the text reading task, and previous studies (Lee et al., 2023; Kim1 et al., 2022) have demonstrated its efficacy in enhancing model performance across various downstream tasks, such as document parsing and document Visual Question Answering (VQA). Therefore, leveraging the text reading task to bolster the capabilities of the base model is of paramount importance.

The data requirements for the text reading task encompass two main aspects: high-quality document images and corresponding text annotations that reflect the reading order. Obtaining such data is intricate, with existing methods either depending on large-scale public document datasets and additional OCR models to generate pseudo-labels (Kim1 et al., 2022) or relying on complex data processing pipelines to scrape document data from the web (Weber et al., 2024). However, these methods often result in low-quality labels, face copyright restrictions, and contend with data noise. Moreover, they typically focus only on specific elements within document images, such as text or certain document components. For example, Nougat (Blecher et al., 2023) and KOSMOS-2.5 (Lv et al., 2023) concentrate on table parsing, while MatCha (Liu et al., 2023a) emphasizes chart rendering. It is rare to find a dataset that encompasses all document elements simultaneously. A recent approach Vary (Wei et al., 2023a), while employing rendering of various document types, has utilized only over 10 templates, which falls short in terms of the richness of document layouts.

To tackle the limitations in document layout richness and the challenges associated with data acquisition, we introduce SynthDoc, a synthetic document generation pipeline. This pipeline is designed to create datasets that include text, images, tables, and a variety of charts. We begin by aggregating publicly available datasets, which have been validated on large language or multimodal models, to form our text and image corpora. We then enhance the TableGeneration (WenmuZhou and SWHL, [n. d.]) to produce a diverse set of tables, and use tools like pandas (pandas development team, 2020), Matplotlib (Hunter, 2007), and ECharts (Li et al., 2018) to generate chart-table pairs, thus expanding our chart data corpus. Therefore, our approach provides three distinct advantages: 1) Synthdoc can leverage redundant, open-resources NLP datasets to generate high-resolution, coherent content for multimodal model training. 2) Synthdoc is developed with high efficiency, precision, and dynamically customizes document layouts and features robust scalability. 3) The synthesized data include comprehensive content and structural annotations, facilitating the pre-training of structured document parsing models based on LLMs. Synthetic data can effectively complement the expensive manually labeled real datasets.

Our comprehensive experiments, leveraging the Donut model, have yielded compelling results that underscore the efficacy of the SynthDoc pipeline. The models trained with our synthesized document images have achieved remarkable accuracy in the pre-training read task, demonstrating a keen ability to parse both Chinese and English text, as well as tables and charts within the generated datasets. This proficiency extends to the fine-tuning phase of downstream tasks, where the models maintain a high level of performance despite the primary and secondary tasks involving languages that are not always consistent.

Furthermore, we have conducted visual analyses of the models’ parsing capabilities on more complex, real-world documents. Despite the relatively limited variety of document types synthesized by our pipeline, the models have shown commendable results in parsing these intricate documents. A particularly surprising finding pertains to the chart parsing capabilities. In instances where scatter plots did not explicitly label the x-axis, our models were able to accurately infer the horizontal coordinates. This suggests that the models trained with our rendered data possess a certain level of spatial understanding and an awareness of the sequence among numerical values.

In response to the absence of comprehensive public datasets for model validation in document image parsing, we have released a set of 5,000 images based on the SynthDoc pipeline. This release not only showcases the quality and diversity of the document data we generate but also provides a benchmark for the document image recognition community to advance and develop new methodologies.

In summary, the key contributions of this paper are as follows:

•

SynthDoc Pipeline: We introduce a novel synthetic data pipeline for document images, named SynthDoc, which utilizes publicly available text or text-image pairs along with rendered tables and charts. This pipeline is capable of simultaneously generating text, images, tables, and various types of charts within document images.
•

Benchmark Release: We have made available to the research community a benchmark dataset consisting of 5,000 image-text pairs. This release aims to highlight the robustness of the data produced by our pipeline and to support further research and development in the area of document image parsing.
•

Experimental Validation: Through experiments based on the Donut model, we have demonstrated that our proposed dataset and training methodology lead to a significant enhancement in the model’s document image parsing capabilities. Additionally, the models trained with our approach maintain competitive performance across a range of downstream tasks.

Refer to caption — Figure 1. The pipeline of Document Image Synthesis, including layout design and content rendering. The layout design involves planning at three scales: full-page, regional, and line-by-line. Content rendering creates both visual graphics and textual content.

2. Related Work

2.1. Image Document Data

Deep learning-based document image understanding has consistently been recognized as a significant and intricate work, and many datasets have been proposed to parse and understand document images from different perspectives. For example, FUNSD (Jaume et al., 2019) is utilized for form understanding. RVL-CDIP (Harley et al., 2015)is employed for document classification. PubLayNet (Zhong et al., 2019)is utilized for document layout analysis in our study. However, these datasets fail to meet the requirements of recent end-to-end methods, which rely on large amounts of document image data for pretraining. Some approaches (Kim1 et al., 2022; Davis et al., 2022) parse existing document datasets, such as IIT-CDIP (Lewis et al., 2006), by commercial OCR models. However the quality of datasets obtained by such methods is constrained by OCR accuracy, and utilizing commercial OCR models can be costly. Another approachs (Weber et al., 2024; Lv et al., 2023; Wang et al., 2021) rely on the crawler techniques to collect extensive data from the internet, extracting document image data through parsing and filtering, which often yield datasets with considerable noise, due to the complexity of the document, and are subject to copyright restrictions. Unlike these methods, we collect existing web-scale datasets (Wei et al., 2023b; Laurençon et al., 2024; Schuhmann et al., 2021, 2022; Gadre et al., 2024; Xue et al., 2020; Penedo et al., 2023) that have been used by large language models or multimodal large language models, employing a synthetic approach to obtain document image data, which can yield clean data and include complex elements such as charts.

2.2. Text Reading Task

As the end-to-end multimodal model evolves, the task of text reading within document images has gained increasing attention from scholars, affirming its significant value in the field. For example, Donut (Kim1 et al., 2022) is pre-trained on document images and their associated text annotations, reading text from images one by one according to previous text contexts. Nougat (Blecher et al., 2023) follows Donut’s model and training approach, with a specialized focus on the domain of scientific papers, adeptly reading texts, tables, and formulas using markup language. DocParser (Mohamed et al., 2023) introduces the Masked Document Reading method, which is designed to enhance the model’s reasoning capabilities by predicting the text situated within the masked regions. UReader (Ye et al., 2023) utilizes text reading task to train multimodal large language model, and proposes to predict text from any position of document images, which ensures the model can read different parts of texts with the context. Pix2struct (Lee et al., 2023) found that the text reading task showed a strong curriculum learning effect, using it as warmup phase resulted in more stable training, faster convergence, and better performance. It is worth noting that all of these tasks require millions of document images, kosmos2.5 (Lv et al., 2023) collected 324.4M data from public datasets and web, such as IIT-CDIP (Lewis et al., 2006), arxiv, and GitHub. However, these data are difficult to obtain and have copyright restrictions, so we propose a data rendering pipeline for text reading task to improve the model’s understanding of dual-language documents.

2.3. Synthetic Document Image

Document image data generation has been widely concerned in the field of visual document understanding. Some document image generation algorithms, based on GAN networks, generate plausible document images, emphasizing the diversity and quality of generated documents. For example, Biswas et al.(Biswas et al., 2021) utilize the GAN model to generate diverse and credible document images based on the provided layout. Zheng et al.(Zheng et al., 2019) proposed a layout depth generation model for graphic design, which implicitly captured the influence of visual and text content on layout, and synthesized complex layout design according to the visual and text semantics input by users. However, these methods do not consider the annotation information used for visual document understanding, the quality and size of the generated images are limited by the model, and additional training models are required for different languages, which is inefficient. Other methods generate document and ground truth pairs for specific visual document understanding tasks. For document layout analysis, Pisaneschi et al.(Pisaneschi and et al., 2023) generates document layout information based on LayoutTransformer(Gupta et al., 2021) and additional post-processing methods which fill in the corresponding texts, images, and Mathematical objects based on the model or the collected corpus. Ling et al.(Ling et al., 2021) proposed the document domain randomization approach to simulate the document layout, and then randomly fill in collected elements such as texts and images. For pretraining of Document Intelligence tasks, Biten et al.(Biten et al., 2022) generates large-scale pre-training datas with OCR annotation information on IDL datasets based on commercial OCR tools. However, the current pre-training of intelligent document understanding based on large language models relies on document image parsing tasks, and the existing data can no longer meet the training demands, so we propose a new data set generation pipeline to synthesize accurate, clear, logical and coherent document parsing datasets to adapt to the development of visual document understanding.

A similar effort to this paper is donut(Kim1 et al., 2022), which uses a portion of generated data to supplement data in different languages. The difference is that their work randomly pastes text into images and ignores layout information and structured elements such as tables, charts and images.

3. Document Image Synthesis

In this section, we delve into the pipeline for generating document images, which is primarily composed of two key components: layout design and content rendering, as shown in Fig. 1. The layout design encompasses the architectural planning at three distinct scales: the entire page, individual regions, and lines of text. This meticulous arrangement ensures that the document’s structure aligns with conventional reading habits while maximizing visual diversity. Content rendering, on the other hand, is responsible for the creation of both graphic and textual elements. This phase includes the rendering of graphics, which can consist of tables, images, and charts, as well as the rendering of text. Each element is crafted with attention to detail, ensuring that the final document image not only conveys information accurately but also presents it in an aesthetically pleasing and reader-friendly manner.

3.1. Layout Design

The document image synthesis pipeline comprises three integral components: the Page Controller, Region Controller, and Line Controller. The Page Controller ensures a consistent and visually appealing layout by defining and maintaining layout elements and typographical attributes. The Region Controller segments the document into distinct areas for various content types, facilitating a logical and balanced composition. Lastly, the Line Controller meticulously organizes text, applying typographical rules to enhance readability and engagement. Together, these components work to create structured, professional-looking documents that are both informative and aesthetically pleasing.

3.1.1. Page Controller.

The Page Controller is instrumental in establishing a consistent and visually appealing layout for single-page documents. It sets and maintains the uniformity of layout elements such as data areas, page margins, and the spacing between segments and lines. Additionally, it oversees the font size and color palette, ensuring that the document’s visual presentation is coherent and reader-friendly. This component’s role is critical in creating a structured and professional look that enhances the document’s overall readability and impact.

3.1.2. Region Controller.

The Region Controller plays a pivotal role in the document’s structural integrity by meticulously segmenting the data areas into distinct regions for text, images, tables, and charts. It operates on a macro level, determining where each type of content will be placed to optimize readability and visual impact. This controller ensures that the document’s layout supports a logical flow, with areas designated for complex data representations such as charts and tables, and separate sections for textual content. By carefully allocating space for each element, the Region Controller ensures that the document’s overall composition is balanced and adheres to the principles of good document design, allowing readers to navigate the information with ease.

3.1.3. Line Controller.

The Line Controller is responsible for the micro-level organization of textual content within the document. It takes the individual word images produced by the Text Renderer and arranges them into coherent lines, respecting the predefined attributes such as word spacing, line height, and alignment. This controller’s work is crucial for establishing the document’s typographical style, which includes setting the rhythm and pacing of the text. By fine-tuning the line breaks, indentations, and other typographical elements, the Line Controller ensures that the text is not only legible but also visually engaging. This attention to detail in formatting contributes to a professional and polished appearance, enhancing the document’s overall presentation quality.

3.2. Content Rendering

With the layout meticulously established, the pipeline transitions to the content rendering phase, where the visual and textual elements of the document come to life. This stage involves the intricate process of integrating graphics and text, ensuring that each component not only complements the layout but also enhances the document’s overall narrative and aesthetic appeal.

3.3. Experimental Results

3.3.1. Graphic Renderer.

The Graphic Renderer is a sophisticated component of our pipeline, dedicated to the rendering of images, tables, and charts. For images, we focus on incorporating natural images, where available category data is used to caption and embed the images within the document. If category information is present, the returned text represents the category; otherwise, it is replaced with a generic placeholder ”¡nature_image¿”. This approach ensures that each image is contextually relevant and enhances the document’s informational content.

In the realm of tables, we have designed two distinct types to accommodate various data presentations. The first type features complete borders, suitable for complex data with line breaks within cells, while the second type adopts a minimalist or borderless style, aligning with the prevalent aesthetic in research publications. Both types incorporate random cell merging to manage data complexity effectively. The rendered tables are displayed in Fig. 2.

For charts, our pipeline supports the rendering of four chart types: bar, pie, line, and scatter plots. Bar charts, available in both horizontal and vertical orientations, are crafted for data comparison, with key-value pairs represented in a tabular format to facilitate readability. To mitigate issues with overlapping labels in vertical bar charts, we implement random fonts and rotation angles. Pie charts, similar in rendering to bar charts, require that the aggregated values represent a total of 1 or 100, expressed as decimals or percentages. Line charts illustrate trends over time or variables, with each chart featuring a unique set of data groups and points, generating an image-label pair. Scatter plots, used to depict the distribution of a single element, employ a label and x and y coordinates for each point, with the number of points limited to a range of [5, 20] to manage complexity. The generated examples are depicted in Fig. 3a. The corresponding HTML annotations are displayed in Fig. 3b.

The emphasis on the model’s ability to understand the structure of diverse elements is paramount. We refrain from using AI tools to generate data within elements, instead leveraging an open textual corpus for our tables and charts, ensuring the authenticity and relevance of the data. The matplotlib library is utilized for chart rendering, and we have refined table rendering techniques to better integrate with the document’s overall design.

Table 1. The comparison between different methods across diverse synthetic documents.

Metrics	Methods	Pure Document		Complex Document			Average
Metrics	Methods	English	Chinese	Doc w/image	Doc w/table	Doc w/chart	Average
AED $\downarrow$	Donut (Kim1 et al., 2022)	0.3764	0.5148	0.7631	0.8679	0.9097	0.6864
	vary (Wei et al., 2023a)	0.1452	0.1760	0.5598	0.7415	0.6663	0.4578
	our	0.0321	0.1370	0.1665	0.0583	0.1029	0.0994
F1-score $\uparrow$	Donut (Kim1 et al., 2022)	0.9370	0.8107	0.3720	0.4573	0.2840	0.5722
	vary (Wei et al., 2023a)	0.8554	0.9002	0.5852	0.5854	0.6531	0.7159
	our	0.9611	0.9020	0.8855	0.9199	0.8810	0.9099
Prediction $\uparrow$	Donut (Kim1 et al., 2022)	0.9534	0.8256	0.4061	0.5302	0.4063	0.6243
	vary (Wei et al., 2023a)	0.8762	0.8974	0.6383	0.7026	0.7961	0.7821
	our	0.9717	0.9136	0.9065	0.9347	0.9017	0.9256
Recall $\uparrow$	Donut (Kim1 et al., 2022)	0.9228	0.8015	0.3647	0.4313	0.2540	0.5549
	vary (Wei et al., 2023a)	0.8482	0.9044	0.5746	0.5501	0.5868	0.6928
	our	0.9515	0.8916	0.8682	0.9076	0.8636	0.8965

3.3.2. Text Renderer.

The Text Renderer plays an indispensable role in the content rendering process, meticulously generating word images for each word in the text. This method affords a high level of control over the typography and layout, ensuring that the text is not only legible but also aesthetically integrated with the document’s visual elements. The Text Renderer works in concert with the Graphic Renderer to weave a cohesive and engaging narrative, blending visual and textual information to enhance the reader’s experience.

Following Donut’s data generation approach, the Text Renderer creates a word image for each word, which is crucial for the document’s visual composition and label generation. This attention to detail in text rendering ensures that the document’s textual content is as carefully crafted as its visual elements, contributing to a polished and professional final product.

Table 2. Performance Comparison of different methods on CORD.

Model	OCR	Acc	Precision	Recall	F1
BERT (Hwang et al., 2019)	$\surd$	78.2	-	-	82.2
BROS (Hong et al., 2022)	$\surd$	80.3	-	-	83.7
LayoutLMv2 (Xu et al., 2020b)	$\surd$	87.0	-	-	88.9
KOSMOS-2.5 (Lv et al., 2023)	-	-	83.64	87.83	85.69
Donut (Kim1 et al., 2022)	-	93.5	-	-	91.6
our	-	90.1	82.6	83.3	82.9

3.4. Concerns of Data Generation Pipeline

3.4.1. Scalability

Even if we generate as much diverse data as possible, it hardly covers all real-world document layouts. To mitigate this, we’ve integrated real document images into our benchmark to maximize layout variability. However, it is worth noting that our solution is highly adaptable, with scalability in two key dimensions: 1) Layout Customization: We allow for tailored document layouts to swiftly and cost-effectively expand our training data to fit various scenarios. 2) Language Independence: Our pipeline transcends language barriers, enabling document image generation in any language. For instance, we’ve produced French documents using the ROOTS(Laurençon et al., 2022) dataset.

3.4.2. Data Privacy

Our pipeline allows for local regulatory adaptation and reproducibility of datasets through customizable pipeline components. We advocate for the use of public corpora and tools to foster transparency and verifiability in research.

4. Training on SynthDoc

This section details the pre-training of the model based on the Donut architecture, focusing on its parsing performance with bilingual (English and Chinese) documents. The primary objective is to validate the model’s ability to effectively handle and interpret content in both languages, ensuring its suitability for multilingual document analysis.

4.1. Model Architecture

Unlike previous OCR-based approaches (Huang et al., 2022; Bai et al., 2022) for visual document understanding tasks, recent research (Lee et al., 2023; Mohamed et al., 2023) has shifted towards parsing document images in an end-to-end fashion, eliminating the need for OCR results as input. The dataset we generated primarily aims to enhance and validate the visual document parsing capabilities of this end-to-end models. Illustrated in Figure 4, our model is constructed based on the Donut architecture. We follow the Donut (Kim1 et al., 2022), utilizing the Swin-Transformer (Liu et al., 2021) as our visual encoder. Previous experiments have demonstrated its superior performance compared to ViT (Dosovitskiy et al., 2020). We employ mBART (Lewis et al., 2019) as the decoder, which has stronger noise robustness and multilingual capabilities.

4.2. Implementation Details

Following the previous works (Blecher et al., 2023; Kim1 et al., 2022), we employ Swin-Base as the encoder and the first four layers of mBART as the decoder, with a patch size of 4 and a window size of 10. We set the input image size to (H, W) = (1280, 960) to meet the requirements of Swin-Base for image dimensions. For pre-training, we set a batch size of 192 and employ the AdamW optimizer, initializing the learning rate at 5e-5 and setting a minimum of 7.6e-6, while utilizing an exponential scheduler with a gamma of 0.9996, updating the learning rate every 16 training steps. For fine-tuning, we utilize a cosine scheduler with a learning rate of 3e-5 to optimize our model, dynamically adjusting the input size according to the datasets, a practice effectively demonstrated by Donut.

4.2.1. Document Image Parsering.

We evaluate the document parsing capabilities of other end-to-end models using the proposed benchmark in this paper and compare them with the performance of the model we trained. As shown in Table 1, we evaluate the models on five types of documents: English documents, Chinese documents, documents with natural images, documents with tables, and documents with charts. We observed that all models exhibited strong performance on English and Chinese documents, except for Donut, which showed slightly inferior results on the Average Edit Distance (AED), possibly due to its lack of training on the document dataset. However, with the exception of our model, all models displayed inadequate performance on complex documents containing additional elements. Specifically, our model achieved 0.1665, 0.0583, and 0.1029 AED on document images with images, tables, and charts, respectively, showing reductions of 0.3933, 0.6832, and 0.5634 compared to the Vary. It is noteworthy that in our benchmark, text labels associated with other elements represent only a small portion. This observation indicates that elements such as images in documents can significantly impact the model’s text parsing capability.

4.2.2. Results on CORD.

The CORD dataset is a collection of data used for receipt recognition, comprising 800 samples for training and 100 samples for testing. Our pipeline’s performance on the English CORD dataset did not demonstrate the expected improvements, due to the substantial distribution bias towards Chinese, which can be addressed by enhancing our model to more adeptly handle English-language documents in subsequent research. However, it is worth noting that our model not only improves its proficiency in Chinese document image recognition but also ensures comparable performance in downstream tasks.

4.3. Visual Analysis

We provide sufficient visualization results of our model to demonstrate the excellent performance of the model in text image recognition. Specifically, The Figure 5 illustrates synthetic images, containing tables, images, and charts demonstrating our model’s ability to parse text, tables, images, and charts information in a manner consistent with human reading order. Furthermore, as illustrated in the last row of Figure 6, our model exhibits robust parsing capability when applied to real document images.

4.3.1. Spatial Understanding.

We observed that end-to-end models possess strong spatial understanding capabilities. Specifically, we provided serialized numerical coordinates in scatter plots and line graphs, defining a new coordinate space. Our trained model can accurately identify the localization of points in this coordinate space. As shown in Figure 5c is the document image with a scatter chart, and Figure 5f is the model’s prediction, we only provided the vertical coordinates of the points in the image. However, the model can accurately identify their corresponding horizontal coordinates. For example, for a point with a vertical coordinate of 435.18, the model can identify its horizontal coordinate as 1096, which closely aligns with our provided ground truth.

4.3.2. Robust Interference Capability.

Benefiting from training our model with documents containing natural images, our model exhibits robust interference capability. As shown in Figure 6, Figure 6a and Figure 6c presents a real image captured by a camera, while Figure 6b and Figure 6d illustrates the model’s prediction. Despite incorrectly identifying some challenging regions as natural images, it does not impede subsequent text parsing. This phenomenon has not been observed in other end-to-end methods. We believe that training with synthetic data incorporating various contexts is an important approach to improving model robustness and performance.

5. Limitation

While the current generation of documents through SynthDoc is a significant step forward, we acknowledge that the types of documents created thus far are somewhat limited in variety. To enhance the richness of our dataset and to better mimic the complexity of real-world documents, we are committed to expanding our pipeline’s capabilities. Future iterations will incorporate more sophisticated intermingling of document elements, allowing for the generation of even more intricate and varied document types. This evolution will not only challenge and refine existing models but also pave the way for the development of more advanced document image recognition systems, capable of handling the multifaceted nature of documents encountered in everyday applications.

6. Conclusion

In conclusion, this study presents SynthDoc, an innovative pipeline for generating synthetic documents, which plays a pivotal role in bolstering Visual Document Understanding (VDU). By producing a high-quality, diverse dataset that encompasses text, images, tables, and charts, SynthDoc addresses the critical issues of data acquisition and the constraints imposed by current datasets. Utilizing publicly accessible corpora and sophisticated rendering tools, SynthDoc has successfully created a dataset that is both extensive and adaptable. Our empirical evaluations, employing the Donut model, have shown that models trained on SynthDoc’s dataset not only excel in pre-training read tasks but also exhibit resilience in downstream tasks, even when faced with linguistic disparities. The introduction of a benchmark dataset featuring 5,000 image-text pairs not only highlights the capabilities of our pipeline but also serves as a substantial contribution to the VDU community, facilitating further research and development in the realm of document image recognition. This research marks a significant advancement in the field by providing a scalable approach to overcoming data scarcity and by empirically validating the effectiveness of end-to-end models in parsing intricate, real-world documents.

References

(1)
Appalaraju et al. (2023) Srikar Appalaraju, Peng Tang, Qi Dong, Nishant Sankaran, Yichu Zhou, and R. Manmatha. 2023. DocFormerv2: Local Features for Document Understanding. arXiv preprint arXiv:2306.01733 (2023).
Bai et al. (2022) Haoli Bai, Zhiguang Liu, Xiaojun Meng, Wentao Li, Shuang Liu, Nian Xie, Rongfu Zheng, Liangwei Wang, Lu Hou, Jiansheng Wei, Xin Jiang, and Qun Liu. 2022. Wukong-Reader: Multi-modal Pre-training for Fine-grained Visual Document Understanding. arXiv preprint arXiv:2212.09621 (2022).
Biswas et al. (2021) Sanket Biswas, Pau Riba, Josep Lladós, and Umapada Pal. 2021. DocSynth: a layout guided approach for controllable document image synthesis. In International Conference on Document Analysis and Recognition. 555–568.
Biten et al. (2022) Ali Furkan Biten, Ruben Tito, Lluis Gomez, Ernest Valveny, and Dimosthenis Karatzas. 2022. Ocr-idl: Ocr annotations for industry document library dataset. In European Conference on Computer Vision. 241–252.
Blecher et al. (2023) Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. 2023. Nougat: Neural optical understanding for academic documents. arXiv preprint arXiv:2308.13418 (2023).
Davis et al. (2022) Brian Davis, Bryan Morse, Bryan Price, Chris Tensmeyer, Curtis Wigington, and Vlad Morariu. 2022. End-to-end document recognition and understanding with dessurt. In European Conference on Computer Vision. 280–296.
Dosovitskiy et al. (2020) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
Gadre et al. (2024) Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, Eyal Orgad, Rahim Entezari, Giannis Daras, Sarah Pratt, Vivek Ramanujan, Yonatan Bitton, Kalyani Marathe, Stephen Mussmann, Richard Vencu, Mehdi Cherti, Ranjay Krishna, Pang Wei Koh, Olga Saukh, Alexander Ratner, Shuran Song, Hannaneh Hajishirzi, Ali Farhadi, Romain Beaumont, Sewoong Oh, Alex Dimakis, Jenia Jitsev, Yair Carmon, Vaishaal Shankar, and Ludwig Schmidt. 2024. Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems 36 (2024).
Guo et al. (2024) Zhaojun Guo, Jinghui Lu, Xuejing Liu, Rui Zhao, ZhenXing Qian, and Fei Tan. 2024. What Makes Good Few-shot Examples for Vision-Language Models? arXiv preprint arXiv:2405.13532 (2024).
Gupta et al. (2021) Kamal Gupta, Justin Lazarow, Alessandro Achille, Larry S Davis, Vijay Mahadevan, and Abhinav Shrivastava. 2021. Layouttransformer: Layout generation and completion with self-attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1004–1014.
Harley et al. (2015) Adam W. Harley, Alex Ufkes, and Konstantinos G. Derpanis. 2015. Evaluation of deep convolutional nets for document image classification and retrieval. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR). 991–995.
Hong et al. (2022) Teakgyu Hong, Donghyun Kim, Mingi Ji, Wonseok Hwang, Daehyun Nam, and Sungrae Park. 2022. Bros: A pre-trained language model focusing on text and layout for better key information extraction from documents. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 10767–10775.
Huang et al. (2022) Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. 2022. Layoutlmv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia. 4083–4091.
Hunter (2007) John D. Hunter. 2007. Matplotlib: A 2D graphics environment. Computing in Science & Engineering (2007), 90–95. https://doi.org/10.1109/MCSE.2007.55
Hwang et al. (2019) Wonseok Hwang, Seonghyeon Kim, Minjoon Seo, Jinyeong Yim, Seunghyun Park, Sungrae Park, Junyeop Lee, Bado Lee, and Hwalsuk Lee. 2019. Post-OCR parsing: building simple and robust parser via BIO tagging. In Workshop on Document Intelligence at NeurIPS 2019.
Jaume et al. (2019) Guillaume Jaume, Hazim Kemal Ekenel, and Jean-Philippe Thiran. 2019. Funsd: A dataset for form understanding in noisy scanned documents. In International Conference on Document Analysis and Recognition Workshops, Vol. 2. 1–6.
Kim1 et al. (2022) Geewook Kim1, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. 2022. OCR-free Document Understanding Transformer. In European Conference on Computer Vision. 498–517.
Laurençon et al. (2024) Hugo Laurençon, Lucile Saulnier, Leo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh. 2024. Obelics: An open web-scale filtered dataset of interleaved image-text documents. Advances in Neural Information Processing Systems 36 (2024).
Laurençon et al. (2022) Hugo Laurençon, Lucile Saulnier, Thomas Wang, Christopher Akiki, Albert Villanova del Moral, Teven Le Scao, Leandro Von Werra, Chenghao Mou, Eduardo González Ponferrada, Huu Nguyen, et al. 2022. The bigscience roots corpus: A 1.6 tb composite multilingual dataset. Advances in Neural Information Processing Systems 35 (2022), 31809–31826.
Lee et al. (2023) Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, and Ming-Wei Chang nad Kristina Toutanova. 2023. Pix2struct: Screenshot parsing as pretraining for visual language understanding. In International Conference on Machine Learning. 18893–18912.
Lewis et al. (2006) David Lewis, Gady Agam, Shlomo Argamon, Ophir Frieder, David Grossman, and Jefferson Heard. 2006. Building a test collection for complex document information processing. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. 665–666.
Lewis et al. (2019) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019).
Li et al. (2018) Deqing Li, Honghui Mei, Yi Shen, Shuang Su, Wenli Zhang, Junting Wang, Ming Zu, and Wei Chen. 2018. ECharts: a declarative framework for rapid construction of web-based visualization. Visual Informatics (2018), 136–146.
Li et al. (2019) Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, and Zhoujun Li. 2019. Tablebank: A benchmark dataset for table detection and recognition. arXiv preprint arXiv:1903.01949 (2019).
Li et al. (2020) Minghao Li, Yiheng Xu, Lei Cui, Shaohan Huang, Furu Wei, Zhoujun Li, and Ming Zhou. 2020. DocBank: A benchmark dataset for document layout analysis. In Proceedings of the 28th International Conference on Computational Linguistics. 949–960.
Liao et al. (2023) Haofu Liao, Aruni RoyChowdhury, Weijian Li, Ankan Bansal, Yuting Zhang, Zhuowen Tu, Ravi Kumar Satzoda, R. Manmatha, and Vijay Mahadevan. 2023. Doctr: Document transformer for structured information extraction in documents. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 19584–19594.
Ling et al. (2021) Meng Ling, Jian Chen, Torsten Möller, Petra Isenberg, Tobias Isenberg, Michael Sedlmair, Robert S Laramee, Han-Wei Shen, Jian Wu, and C Lee Giles. 2021. Document domain randomization for deep learning document layout extraction. In Document Analysis and Recognition–ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part I 16. 497–513.
Liu et al. (2023a) Fangyu Liu, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Yasemin Altun, Nigel Collier, and Julian Martin Eisenschlos. 2023a. MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering. In ACL. 12756–12770. https://doi.org/10.18653/v1/2023.acl-long.714
Liu et al. (2023b) Xuejing Liu, Wei Tang, Jinghui Lu, Rui Zhao, Zhaojun Guo, and Fei Tan. 2023b. Deeply coupled cross-modal prompt learning. In Findings of the Association for Computational Linguistics: ACL 2023. 7957–7970.
Liu et al. (2023c) Xuejing Liu, Wei Tang, Xinzhe Ni, Jinghui Lu, Rui Zhao, Zechao Li, and Fei Tan. 2023c. What Large Language Models Bring to Text-rich VQA? arXiv preprint arXiv:2311.07306 (2023).
Liu et al. (2021) Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision. 10012–10022.
Lv et al. (2023) Tengchao Lv, Yupan Huang, Jingye Chen, Lei Cui, Shuming Ma, Yaoyao Chang, Shaohan Huang, Wenhui Wang, Li Dong, Weiyao Luo, Shaoxiang Wu, Guoxin Wang, Cha Zhang, and Furu Wei. 2023. Kosmos-2.5: A multimodal literate model. arXiv preprint arXiv:2309.11419 (2023).
Mohamed et al. (2023) Dhouib Mohamed, Bettaieb Ghassen, and Shabou Aymen. 2023. DocParser: End-to-end OCR-free Information Extraction from Visually Rich Documents. In International Conference on Document Analysis and Recognition. 155–172.
pandas development team (2020) The pandas development team. 2020. pandas-dev/pandas: Pandas. https://doi.org/10.5281/zenodo.3509134
Park et al. (2019) Seunghyun Park, Seung Shin, Bado Lee, Junyeop Lee, Jaeheung Surh, and Minjoon Seo. 2019. CORD: a consolidated receipt dataset for post-OCR parsing. In Workshop on Document Intelligence at NeurIPS 2019.
Penedo et al. (2023) Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. 2023. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data only. Advances in Neural Information Processing Systems 36 (2023), 79155–79172.
Pfitzmann et al. (2022) Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S Nassar, and Peter Staar. 2022. Doclaynet: A large human-annotated dataset for document-layout segmentation. In Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining. 3743–3751.
Pisaneschi and et al. (2023) Lorenzo Pisaneschi and et al.. 2023. Automatic generation of scientific papers for data augmentation in document layout analysis. PRL (2023).
Schuhmann et al. (2022) Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. 2022. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems 35 (2022), 25278–25294.
Schuhmann et al. (2021) Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. 2021. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021).
Tang et al. (2024) Wei Tang, Liang Li, Xuejing Liu, Lu Jin, Jinhui Tang, and Zechao Li. 2024. Context disentangling and prototype inheriting for robust visual grounding. IEEE Transactions on Pattern Analysis and Machine Intelligence 46, 5 (2024), 3213–3229.
Wang et al. (2021) Zilong Wang, Yiheng Xu, Lei Cui, Jingbo Shang, and Furu Wei. 2021. Layoutreader: Pre-training of text and layout for reading order detection. arXiv preprint arXiv:2108.11591 (2021).
Weber et al. (2024) Maurice Weber, Carlo Siebenschuh, Rory Butler, Anton Alexandrov, Valdemar Thanner, Georgios Tsolakis, Haris Jabbar, Ian Foster, Bo Li, Rick Stevens, and Ce Zhang. 2024. WordScape: a Pipeline to extract multilingual, visually rich Documents with Layout Annotations from Web Crawl Data. Advances in Neural Information Processing Systems 36 (2024).
Wei et al. (2023a) Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, Jinrong Yang, Jianjian Sun, Chunrui Han, and Xiangyu Zhang. 2023a. Vary: Scaling up the vision vocabulary for large vision-language models. arXiv preprint arXiv:2312.06109 (2023).
Wei et al. (2023b) Tianwen Wei, Liang Zhao, Lichang Zhang, Bo Zhu, Lijie Wang, Haihua Yang, Biye Li, Cheng Cheng, Weiwei Lü, Rui Hu, Chenxia Li, Liu Yang, Xilin Luo, Xuejie Wu, Lunan Liu, Wenjun Cheng, Peng Cheng, Jianhao Zhang, Xiaoyu Zhang, Lei Lin, Xiaokun Wang, Yutuan Ma, Chuanhai Dong, Yanqi Sun, Yifu Chen, Yongyi Peng, Xiaojuan Liang, Shuicheng Yan, Han Fang, and Yahui Zhou. 2023b. Skywork: A More Open Bilingual Foundation Model. arXiv preprint arXiv:2310.19341 (2023).
WenmuZhou and SWHL ([n. d.]) WenmuZhou and SWHL. [n. d.]. TableGeneration. GitHub repository. https://github.com/WenmuZhou/TableGeneration
Xu et al. (2020a) Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. 2020a. Layoutlm: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 1192–1200.
Xu et al. (2020b) Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, and Lidong Zhou. 2020b. Layoutlmv2: Multi-modal pre-training for visually-rich document understanding. arXiv preprint arXiv:2012.14740 (2020).
Xue et al. (2020) Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2020. mT5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934 (2020).
Ye et al. (2023) Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Guohai Xu, Chenliang Li, Junfeng Tian, Qi Qian, Ji Zhang, Qin Jin, Liang He, Xin Lin, , and Fei Huang. 2023. UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model. In Findings of the Association for Computational Linguistics: EMNLP 2023. 2841–2858.
Zheng et al. (2019) Xinru Zheng, Xiaotian Qiao, Ying Cao, and Rynson WH Lau. 2019. Content-aware generative modeling of graphic design layouts. ACM Transactions on Graphics (TOG) 38 (2019), 1–15.
Zhong et al. (2019) Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. 2019. PubLayNet: Largest Dataset Ever for Document Layout Analysis. In 2019 International conference on document analysis and recognition (ICDAR). 1015–1022. https://doi.org/10.1109/ICDAR.2019.00166