Proposal Report for the 2nd SciCAP Competition 2024
Abstract
In this paper, we propose a method for document summarization using auxiliary information. This approach effectively summarizes descriptions related to specific images, tables, and appendices within lengthy texts. Our experiments demonstrate that leveraging high-quality OCR data and initially extracted information from the original text enables efficient summarization of the content related to described objects. Based on these findings, we enhanced popular text generation model models by incorporating additional auxiliary branches to improve summarization performance. Our method achieved top scores of 4.33 and 4.66 in the long caption and short caption tracks, respectively, of the 2024 SciCAP competition, ranking highest in both categories.
1 Introduction
Tables, images, and other objects within a paper can help readers quickly grasp the main ideas. Therefore, extracting descriptive information related to these objects from the text has garnered increasing attentionYang et al. (2021a) Yang et al. (2021b). With the development of the transformerVaswani et al. (2017) architecture, many transformer-based methods have been applied to the field of text generation, efficiently performing tasks such as text segmentation, summarization, and description generationLüddecke and Ecker (2022)Yang et al. (2019b). Extracting descriptive information for related objects has also gained significant focusYang et al. (2022a). Compared to summarizing the document itself, summarizing descriptions related to objects is more challenging. This task requires attention not only to the text’s context but also to the relevance between the objects and certain parts of the text. However, current methods often treat this task as a simple text generation problem.
Mainstream text generation methods include generating summaries from plain text and using multimodal techniques to align text and image information to generate summariesWan et al. (2024a)Yang et al. (2022b). For the former, a common approach is to perform basic OCR on the objects or use models like CLIPRadford et al. (2021) to generate object descriptions. Most of these methods focus only on the information contained in the original or related text, without fully utilizing the information about the objects themselves. For multimodal methods, large language models such as LLaMA or GPTWan et al. (2024b) and multimodal models such as OFAYang et al. (2023) and BLIP2Li et al. (2023a) are widely used. While these models achieve high accuracy, they have limitations and can serve as auxiliary tools in the model fusion stage. Furthermore, previous methods typically rely on a single inference pass to obtain results, which can be biased. Lengthy descriptions often contain substantial background information that acts as noise, affecting the model’s performanceFu et al. (2024).
Based on the above incomplete reasoning process, we propose our own summarization model. To address the incomplete utilization of information, we use PaddleOCR, an OCR recognition tool from the open-source Paddle model, to obtain high-quality OCR information about the objects. For the inference process, we deviate from the single-pass inference method. We designed a filtering method that divides paragraphs into multiple chunks, allowing the model to identify chunks that are useful for generating the correct answer. After an initial extraction of lengthy text, we perform secondary inference on the descriptions related to the objects. This segmented inference approach allows the model to focus on information with strong and weak relevance to the objects separately, reducing the impact of irrelevant text on the results.
Our contributions can be summarized as follows:
1. We utilize information stored within the objects themselves, which is often overlooked, as input for descriptive information. This allows the model to focus more on the descriptions related to the relevant objects rather than the text itself.
2. Using a two-stage training and inference process, our model minimizes attention to text weakly related to the objects while reducing the noise interference from background information.
2 Related Work
The task of generating descriptions based on object or text information has seen significant developmentYang et al. (2024a)Li et al. (2024)Yang et al. (2024b). Current mainstream methods are broadly categorized into pure text generation and multi-modal generation.
Text Generation. This method focuses primarily on extensive descriptive information about the objects. Models excel at extracting accurate descriptions from textual information. For non-textual information, such as images and tables, models like CLIPRadford et al. (2021) are often used to generate a rough description of the object, or OCR technology is employed to recognize the characters and lines inherent to the object. These pieces of information are then input into the model along with the text for supervised learning. While this approach is easy to implement, it often fails to yield satisfactory results due to the limited learning capacity of text generation models. Additionally, the fusion of object information and textual information is not well-integrated.
Multi-Modal Generation. Multi-modal generation involves using two types of modalities at the input stage: one directly encodes the object, and the other encodes the text informationYang et al. (2019a)Li et al. (2023b). Thanks to the rapid advancements in technologies like LLaMA and GPTWan et al. (2024b), the alignment issues between modalities have become negligible. Consequently, using large multi-modal models generally produces excellent results. However, due to the enormous parameters and technical complexity of these models, making technical improvements is not straightforward.
Overall, improving text generation methods is both practical and efficient. We have made enhancements to the text generation model to fully utilize known information, enabling it to understand and efficiently extract all relevant information. Detailed descriptions of these methods can be found in Chapter 3.

3 Methodology
3.1 Overall Architecture
Figure 1 illustrates the overall architecture of our solution, which contains three components: Prepare training data, Generate summaries and Model-ensemble. Next, we will specifically introduce the three components mentioned above.
3.2 Prepare training data
OCR information extraction from images aims to convert textual information contained within images into a text format, enabling the generation of more accurate and detailed image descriptions, thereby enhancing understanding of the image content. However, we found that OCR information in the images provided by the official dataset is not accurate, as illustrated in the figure 2, making it difficult to generate precise image descriptions. To address this issue, we have chosen to use the PaddleOCR to re-extract OCR information from the images. This step aims to ensure that we obtain more accurate and reliable textual data to support subsequent summary generation tasks.

Filter gold sentence in paragraphs aims to filter out redundant information in paragraphs to create more refined and effective training data, which is beneficial for model training. Specifically, we first concatenate the provided multiple paragraphs into a complete paragraph, then use the SpacyTextSplitter tool to divide this complete paragraph into several different blocks, represented by . Afterwards, we construct two types of input sequences: one is the mention combined with the chunk , and the other consists only of mention , which can be represented as and respectively. We will measure the differences in the probabilities of model to generate the given output , this process can be represented as
. Furthermore, to achieve the goal of filtering out irrelevant sentences, we set a threshold . Sentences with probabilities below this threshold are filtered out, thereby the set of sentences that ultimately make up the paragraph is , considering that the sentences in the set are beneficial for generating the final summary. Finally, we concatenate all the sentences in the set to form the final paragraph, denoted as .
However, during the training phase, since the correct answers are already known, we can filter out irrelevant information from the paragraph using the aforementioned method. But during the testing phase, as the correct answers are unknown, in order to filter out irrelevant information from the test data, we trained the flan-t5-xl model. In the construction of training data for the model, the input consists of the prompt ”Filter out the irrelevant information in the text:” concatenated with the original paragraph , and the model’s output is obtained through the above method. The training process can be represented as . In the testing phase, given the paragraph and the prompt, we use the model to predict the filtered text, . The filtered text is then used in the summary generation phase. Through this method, we filter out paragraph descriptions that are beneficial for generating image captions and reduce the model’s overhead during the summary generation stage.
Construction of prompts aims to ensures clarity in instruction and consistency in data input across different model architectures, facilitating effective training and evaluation. We further develop the training dataset for use in the summary generation phase. We have designed two types of prompts for the PegasusZhang et al. (2020), Pegasus-x-Large, and LLaMA2-13B models, respectively. These are structured as follows:
Prompt 1: ”OCR Text: Mentions: Paragraphs:”
Prompt 2: ”Summarize the following OCR text, mentions, and paragraphs, extracting key information and generating a concise summary. OCR Text: Mentions: Paragraphs: ”
3.3 Generate summaries
Through the steps outlined in the previous section, we can obtain the text description corresponding to each image. During the training phase, the input to the generative model is the constructed description , and the ground truth is the actual caption corresponding to the image, used as the output, which can be formalized as . In the inference phase, we input the constructed text description into the model, enabling the model to generate a summary for the given text.
Method | Blue4 | R-1 | R-2 | R-1-n | R-2-n |
---|---|---|---|---|---|
Base | 0.13 | 0.40 | 0.25 | 2.38 | 3.81 |
+PaddleOCR | 0.11 | 0.37 | 0.28 | 2.48 | 4.02 |
+LLaMA | 0.16 | 0.43 | 0.27 | 2.56 | 4.05 |
+Filter | 0.08 | 0.40 | 0.24 | 2.23 | 4.11 |
combine | 0.06 | 0.39 | 0.24 | 2.37 | 4.66 |
3.4 Model-ensemble
Model ensemble is the final stage. Specifically, we train the Pegasus, Pegasus-x-Large, and LLaMA2-13B models and obtain their weights at different epochs. For Pegasus and Pegasus-X-Large, we use the weights from the 4th, 5th, and 6th epochs for inference, and each model generates 16 candidate summaries. For the LLaMA2-13B model, we use the weights from the 3rd, 4th, 5th, and 6th epochs for inference, with each weight generating one candidate summary. Thus, for each image, we obtain 100 candidate summaries, from which we select the summary that best describes the image. We adopt a method of sequentially assuming each of the 100 candidate summaries as the correct summary. The score of this summary is the average of the cumulative scores of the ROUGE-2-normalized scores compared to the other summaries, which can be formulated as:
, where is defined as an averaged score to all the other captions , is the ROUGE-2-normalized score between two captions and , and represents all the candidate summaries. Ultimately, we select the summary with the highest score as the caption for the image.
4 Experiment
4.1 Dataset
The dataset used in this study is the official dataset provided, which contains approximately 500,000 data samples. Each sample includes a description of an object, a paragraph of original text about the object (without preliminary extraction), and several sentences related to the object (extracted from the paragraph). About 400,000 samples were used for training, and 47,639 samples were used for testing.
4.2 Details
We used two A100 GPUs to train the base models, including pegasus-x-large, and Pegasus, among others. The training was conducted for varying epochs, ranging from 3 to 10, with a learning rate set at 1e-5.
4.3 Results
Our experimental results for the short track is presented in Tables 1. BLEU (Bilingual Evaluation Understudy) is a metric used to evaluate the quality of machine-translated text against one or more reference translations by measuring the n-gram overlap. ROUGE-1 (Recall-Oriented Understudy for Gisting Evaluation) calculates the overlap of unigrams (single words) between the generated text and reference text, focusing on recall. ROUGE-2 extends this by calculating the overlap of bigrams (two-word sequences), providing a more detailed measure of similarity. ROUGE-1 Normalized adjusts the ROUGE-1 score by the length of the reference text, giving a normalized recall value to account for text length differences. ROUGE-2 Normalized does the same for ROUGE-2, normalizing the bigram overlap score to ensure comparability across texts of varying lengths.
5 Conclusion
This paper outlines the methods we employed in the 2024 2nd SciCap competition. We improved upon traditional text generation models by addressing the limitations in extracting information from extensive text and the underutilization of object-specific information. Our approach maximizes the extraction of information from both the text and the objects themselves. Under this method, standard text generation models outperformed large language models in summarizing descriptive text.
References
- Fu et al. [2024] Zhongtian Fu, Kefei Song, Luping Zhou, and Yang Yang. Noise-aware image captioning with progressively exploring mismatched words. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 12091–12099, 2024.
- Li et al. [2023a] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023.
- Li et al. [2023b] Xin-Chun Li, Yang Yang, and De-Chuan Zhan. Mrtf: model refinery for transductive federated learning. volume 37, pages 2046–2069. Springer, 2023.
- Li et al. [2024] Xin-Chun Li, Shaoming Song, Yinchuan Li, Bingshuai Li, Yunfeng Shao, Yang Yang, and De-Chuan Zhan. Map: Model aggregation and personalization in federated learning with incomplete classes. IEEE, 2024.
- Lüddecke and Ecker [2022] Timo Lüddecke and Alexander Ecker. Image segmentation using text and image prompts. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7086–7096, 2022.
- Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Wan et al. [2024a] Fengqiang Wan, WU Xiangyu, Zhihao Guan, and Yang Yang. Covlr: Coordinating cross-modal consistency and intra-modal relations for vision-language retrieval. ICME, 2024.
- Wan et al. [2024b] Fengqiang Wan, WU Xiangyu, Zhihao Guan, and Yang Yang. Covlr: Coordinating cross-modal consistency and intra-modal relations for vision-language retrieval. 2024.
- Yang et al. [2019a] Yang Yang, Zhao-Yang Fu, De-Chuan Zhan, Zhi-Bin Liu, and Yuan Jiang. Semi-supervised multi-modal multi-instance multi-label deep network with optimal transport. IEEE Transactions on Knowledge and Data Engineering, 33(2):696–709, 2019.
- Yang et al. [2019b] Yang Yang, De-Chuan Zhan, Yi-Feng Wu, Zhi-Bin Liu, Hui Xiong, and Yuan Jiang. Semi-supervised multi-modal clustering and classification with incomplete modalities. IEEE Transactions on Knowledge and Data Engineering, 33(2):682–695, 2019.
- Yang et al. [2021a] Yang Yang, Zhen-Qiang Sun, Hengshu Zhu, Yanjie Fu, Yuanchun Zhou, Hui Xiong, and Jian Yang. Learning adaptive embedding considering incremental class. volume 35, pages 2736–2749. IEEE, 2021.
- Yang et al. [2021b] Yang Yang, Chubing Zhang, Yi-Chu Xu, Dianhai Yu, De-Chuan Zhan, and Jian Yang. Rethinking label-wise cross-modal retrieval from a semantic sharing perspective. In IJCAI, pages 3300–3306, 2021.
- Yang et al. [2022a] Yang Yang, Hongchen Wei, Hengshu Zhu, Dianhai Yu, Hui Xiong, and Jian Yang. Exploiting cross-modal prediction and relation consistency for semisupervised image captioning. IEEE Transactions on Cybernetics, 54(2):890–902, 2022.
- Yang et al. [2022b] Yang Yang, Jingshuai Zhang, Fan Gao, Xiaoru Gao, and Hengshu Zhu. Domfn: A divergence-orientated multi-modal fusion network for resume assessment. In Proceedings of the 30th ACM International Conference on Multimedia, pages 1612–1620, 2022.
- Yang et al. [2023] Yang Yang, Chubing Zhang, Xin Song, Zheng Dong, Hengshu Zhu, and Wenjie Li. Contextualized knowledge graph embedding for explainable talent training course recommendation. volume 42, pages 1–27. ACM New York, NY, USA, 2023.
- Yang et al. [2024a] Yang Yang, Jinyi Guo, Guangyu Li, Lanyu Li, Wenjie Li, and Jian Yang. Alignment efficient image-sentence retrieval considering transferable cross-modal representation learning. volume 18, page 181335. Springer, 2024.
- Yang et al. [2024b] Yang Yang, Nan Jiang, Yi Xu, and De-Chuan Zhan. Robust semi-supervised learning by wisely leveraging open-set data. IEEE, 2024.
- Zhang et al. [2020] Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter Liu. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International conference on machine learning, pages 11328–11339. PMLR, 2020.