DocXChain: A Powerful Open-Source Toolchain for Document Parsing and Beyond

Cong Yao
Alibaba DAMO Academy
Beijing, China
Correspondence to: [email protected]

Abstract

In this report, we introduce DocXChain, a powerful open-source toolchain for document parsing, which is designed and developed to automatically convert the rich information embodied in unstructured documents, such as text, tables and charts, into structured representations that are readable and manipulable by machines. Specifically, basic capabilities, including text detection, text recognition, table structure recognition and layout analysis, are provided. Upon these basic capabilities, we also build a set of fully functional pipelines for document parsing, i.e., general text reading, table parsing, and document structurization, to drive various applications related to documents in real-world scenarios. Moreover, DocXChain is concise, modularized and flexible, such that it can be readily integrated with existing tools, libraries or models (such as LangChain and ChatGPT), to construct more powerful systems that can accomplish more complicated and challenging tasks. The code of DocXChain is publicly available at: https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/Applications/DocXChain

1 Introduction

“Make Every Unstructured Document Literally Accessible to Machines”

– The DocXChain Development Team, 2023

Documents are ubiquitous¹¹1In this project, we adopt the broad concept of documents, meaning that DocXChain can support various kinds of documents, including regular documents (such as books, academic papers and business forms), street view photos, presentations and even screenshots., since they are excellent carriers for recording and spreading information across space and time. Documents have been playing a critically important role in the daily work, study and life of people all over the world. Every day, billions of documents in different forms are created, viewed, processed, transmited and stored around the world, either physically or digitally. However, not all documents in the digital world can be directly accessed by machines (including computers and other automatic equipments), as only a portion of the documents can be successfully parsed with low-level procedures. For instance, the Adobe Extract APIs are able to directly convert the metadata of born-digital PDF files into HTML-like trees [10], but would completely fail when handling PDFs generated from photographs produced by scanners or images captured by cameras. Therefore, if one would like to make documents that are not born-digital conveniently and instantly accessible to machines, a powerful toolset for extracting the structures and contents from such unstructured documents [12, 5, 3] is of the essence.

In this article, we introduce a new open-source toolchain for document parsing, called DocXChain, which is dedicated to converting unstructured documents into structured representations. Concretely, DocXChain provides tools to precisely detect layouts, read text and extract tables of documents, and arrange these elements in an organized manner, such that the rich and precious information embodied in various unstructured documents, which is previously not accessible to machines, has been unlocked, and a mass of applications related to documents are henceforth possible.

DocXChain is unique and powerful in that: (1) It assembles a collection of industry-leading algorithmic models for text detection, text recognition, table structure recognition and layout analysis, which are open-sourced by our team and publicly available on ModelScope²²2https://github.com/modelscope/modelscope and AdvancedLiterateMachinery³³3https://github.com/AlibabaResearch/AdvancedLiterateMachinery; (2) Different from existing open-source libraries for OCR and document parsing, the tools in DocXChain can effectively handle documents from real-world scenarios, in addition to those collected for pure academic purposes; (3) DocXChain works out-of-the-box and is compatible with other tools or models (e.g., LangChain [4] and ChatGPT [6]), since it is concise and modularized.

2 Design and Implementation of DocXChain

In this section, we will describe in detail the design and implementation of DocXChain.

2.1 Core Ideology

The core design ideas of DocXChain are three-fold:

•

Object: The central objects of DocXChain are documents, rather than LLMs.
•

Concision: The capabilities for document parsing are presented in a simple “modules + pipelines” fashion, while unnecessary abstraction and encapsulation are abandoned.
•

Compatibility: This toolchain can be used as a stand-alone procedure to structurize documents, while it can also be readily integrated with existing tools, libraries or models, such as LangChain [4], ChatGPT [6] and GPT-4 [7], to build more powerful systems that can solve more complicated and challenging tasks.

2.2 System Overview

Refer to caption — Figure 1: System overview of DocXChain.

The overview of DocXChain is illustrated in Fig. 1. DocXChain provides atomic capabilities as well as fully functional pipelines, which are built upon PyTorch [9], TensorFlow [1], ModelScope [2] and other 3rd-party libraries (such as the libraries for loading images and PDFs).

In general, DocXChain, as a middle-level tool set, can be adopted to support high-level applications related to documents, such as document format conversion (e.g., pdf2word and image2word), DocQA, summarization, search and translation [3].

2.3 Modules and Pipelines

Module	Function Description
File Loading	Load document files. Only images (.jpg and .png) and PDFs (.pdf) are supported currently.
Text Detection	Detect all text instances (those virtually machine-identifiable).
Text Recognition	Recognize each text instance (assume that text detection has been perfomed in advance).
Layout Analysis	Identify and categorize all layout regions (those virtually machine-identifiable).
Table Structure Recognition	Recognize the structure of the given table. At present, only tables with visible borders are supported.

Table 1: Function description of the modules in DocXChain.

The detailed descriptions of the the basic modules in DocXChain are depicted in Tab. 1. Each basic module realizes an atomic capability. DocXChain accepts image and PDF⁴⁴4PDF pages will be converted to images before subsequent processing. By default, only the first page will be chosen and parsed if the input PDF file has multiple pages. files as input. Currently, the supported languages are Chinese and English.

Pipeline	Function Description
General Text Reading	Detect and recognize all text instances (those virtually machine-identifiable).
Table Parsing	Perform table parsing (table structure recognition + textual content recognition).
Document Structurization	Structurize the given document (layout analysis + text detection and recognition).

Table 2: Function description of the pipelines in DocXChain.

The detailed descriptions of the pipelines in DocXChain are shown in Tab. 1. These typical pipelines are built with the basic modules in DocXChain. For example, the General Text Reading pipeline consists of the Text Detection module and the Text Recognition module. For certain, one could make more pipelines to meet different requirements with the modules of DocXChain and other tools or libraries.

2.4 Qualitative Examples

We also evaluate DocXChain on a small set of documents from real-world scenarios. As shwon in Fig. 2, 3 and 4, DocXChain is able to successfully handle documents from different scenarios that are quite common in reality.

Specifically, it can read subway transfer information on a signboard (Fig. 2); it is also able to extract the structure and textual contents of a table containing detailed product specifications (Fig. 3); for documents with complex layout and dense text, it is capable of comprehensively parsing and organizing all the key elements (Fig. 4). In brief, the wide adaptability and high flexibility of DocXChain makes it an excellent choice to power various real-world applications.

3 Conclusion and Outlook

In this article, we have introduced DocXChain, an open-source toolchain for document parsing. It releases algorithmic models and engineering codes to support basic capabilities as well as typical pipelines, which can be used to extract the structures and contents from unstructured documents.

We also notice that the newly released GPT-4V(ision) [8] is capable of reading text from images, understanding charts and reasoning with tables. However, GPT-4V(ision) is not a open-source system, and further quantitative investigations are needed to validate its accuracy and robustness in challenging scenarios [11]. Therefore, our DocXChain, as a lightweight, open-source specialist toolchain for precise document parsing, is definitely highly complementary to such generalists, when analysing and understanding documents in real-world applications.

DocXChain is designed and developed with the original aspiration of promoting the level of digitization and structurization for documents. In the future, we will go beyond pure document parsing capabilities, to explore more possibilities, e.g., combining DocXChain with large language models (LLMs) to perform document information extraction (IE), question answering (QA) and retrieval-augmented generation (RAG).

References

Abadi et al. [2016] Martín Abadi, Paul Barham, Jianmin Chen, Z. Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek Gordon Murray, Benoit Steiner, Paul A. Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zhang. TensorFlow: A system for large-scale machine learning. In USENIX Symposium on Operating Systems Design and Implementation, 2016.
[2] Alibaba DAMO Academy. ModelScope. https://github.com/modelscope/modelscope. Accessed: 2023-10-10.
Cui et al. [2021] Lei Cui, Yiheng Xu, Tengchao Lv, and Furu Wei. Document AI: Benchmarks, Models and Applications. ArXiv, abs/2111.08609, 2021.
[4] LangChainAI. LangChain. https://github.com/langchain-ai/langchain. Accessed: 2023-09-27.
Long et al. [2018] Shangbang Long, Xin He, and Cong Yao. Scene text detection and recognition: The deep learning era. International Journal of Computer Vision, 129:161 – 184, 2018.
OpenAI [a] OpenAI. ChatGPT. https://openai.com/chatgpt, a. Accessed: 2023-09-27.
OpenAI [b] OpenAI. GPT-4. https://openai.com/gpt-4, b. Accessed: 2023-09-27.
OpenAI [c] OpenAI. GPT-4V(ision) System Card. https://cdn.openai.com/papers/GPTV_System_Card.pdf, c. Accessed: 2023-10-09.
Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Neural Information Processing Systems, 2019.
Saad-Falcon et al. [2023] Jon Saad-Falcon, Joe Barrow, Alexa Siu, Ani Nenkova, Ryan A. Rossi, and Franck Dernoncourt. PDFTriage: Question Answering over Long, Structured Documents. ArXiv, abs/2309.08872, 2023.
Yang et al. [2023] Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision). 2023.
Zhu et al. [2015] Yingying Zhu, Cong Yao, and Xiang Bai. Scene text detection and recognition: recent advances and future trends. Frontiers of Computer Science, 10:19 – 36, 2015.