EvoChart: A Benchmark and a Self-Training Approach Towards Real-World Chart Understanding

Muye Huang Lai Han Xinyu Zhang Wenjun Wu Jie Ma Corresponding author. Lingling Zhang Jun Liu
Xi’an Jiaotong University
{huangmuye, 3124151008, zhang1393869716, wenjunwu}@stu.xjtu.edu.cn
{jiema, zhanglling, liukeen}@xjtu.edu.cn

Abstract

Chart understanding enables automated data analysis for humans, which requires models to achieve highly accurate visual comprehension. While existing Visual Language Models (VLMs) have shown progress in chart understanding, the lack of high-quality training data and comprehensive evaluation benchmarks hinders VLM chart comprehension. In this paper, we introduce EvoChart, a novel self-training method for generating synthetic chart data to enhance VLMs’ capabilities in real-world chart comprehension. We also propose EvoChart-QA, a noval benchmark for measuring models’ chart comprehension abilities in real-world scenarios. Specifically, EvoChart is a unique self-training data synthesis approach that simultaneously produces high-quality training corpus and a high-performance chart understanding model. EvoChart-QA consists of 650 distinct real-world charts collected from 140 different websites and 1,250 expert-curated questions that focus on chart understanding. Experimental results on various open-source and proprietary VLMs tested on EvoChart-QA demonstrate that even the best proprietary model, GPT-4o, achieves only 49.8% accuracy. Moreover, the EvoChart method significantly boosts the performance of open-source VLMs on real-world chart understanding tasks, achieving 54.2% accuracy on EvoChart-QA.

1 Introduction

Chart Question Answering (CQA) aims to answer specific questions based on the context provided by chart images, enabling automated data analysis, such as the business data reports. This process requires complex chart understanding and visual reasoning skills to interpret various elements, including visual components, text and values. Consequently, CQA tasks have attracted the interest of researchers [10, 24, 20].

Recently, VLMs [18, 7, 15] have shown significant advancements in general visual capabilities, especially in chart understanding, achieving high scores on the ChartQA dataset [35, 4, 23]. However, their real-world performance is notably weaker than their ChartQA test set performance. We conducted a test to illustrate this, as shown in Figure 1, we discarded complex reasoning problems in the ChartQA [20] training set and posed 103 basic understanding questions. We then evaluated various VLMs on these questions. The results, presented in the Appendix, show that performance dropped by over 40% compared to ChartQA scores, even for questions on training set charts. This highlights two points: first, current VLMs are capable of answering some chart-reasoning questions, but they lack a comprehensive understanding of charts. Second, the ChartQA dataset allows models to correctly answer questions without a comprehensive understanding of the charts, leading to an overestimation of the capabilities of current models [30].

Refer to caption — Figure 1: Case of Modified ChartQA. “Original” refers to the question from the ChartQA dataset, while “Modified” refers to our modified version.

Firstly, the lack of high-quality chart training data is a major reason why current models lack robust chart understanding capabilities. Existing methods [20, 19, 23] for collecting chart training data fall into two categories: manual annotation and automatic synthesis. Manually annotated data have real-world chart appearances but suffer from coarse granularity and high human costs. Automatically synthesized data offer fine-grained annotations but lack real-world diversity, leading to poor model robustness. Thus, constructing chart datasets is a challenging balance between cost and quality, resulting in a scarcity of high-quality chart training data.

Secondly, the single source of charts and the excessive focus on high-level chart reasoning are primary reasons why the ChartQA dataset provides an overly optimistic estimation of VLM chart understanding capabilities. The ChartQA dataset has only four chart sources, focus on politics and economics. Each source of charts with similar styles, making it prone to overfitting. Additionally, datasets like ChartQA focuses heavily on numerical and logical reasoning, this allows the model to potentially answer questions correctly without a clear understanding of the chart. For example, “What is the difference between Bad and Good in 2015?”, the model may not explicitly know the values of “Good” and “Bad” in 2015, but still has the possibility of answering the question accurately.

To address these challenges, we propose a novel method, EvoChart, for synthesizing high-quality chart datasets with real-world characteristics. We also introduce EvoChart-QA, a carefully crafted benchmark for evaluating chart comprehension in real-world scenarios. EvoChart is a multi-stage self-training approach for chart data generation. In each stage, the chart generator produces a batch of synthetic chart data, and the model self-selects and refines the chart data, ensuring that the synthesized data is of high quality for current stage. Subsequently, the model trains on the self-selected data to progress to the next stage. This approach produces both a progressively challenging dataset and a robust chart understanding model. EvoChart-QA is a benchmark designed for basic chart understanding, featuring 650 charts from 140 real-world websites and 1250 expert-curated questions. The diverse chart styles accurately simulate real-world scenarios, with questions focused on chart understanding. Experiments on EvoChart-QA demonstrate that our EvoChart method achieves outstanding performance with 54.2% accuracy, also exhibits leading performance of 81.5% on the ChartQA dataset.

Our main contributions are summarized into three folds:

•

We propose EvoChart, a method that combines chart dataset construction with model self-training, using a multi-stage approach to simultaneously output high-quality chart data and a chart understanding model.
•

We propose a novel real-world chart basic understanding benchmark, EvoChart-QA, which comprehensively evaluates a model’s chart understanding capability through multi-source real-world charts and multi-type manually curated questions.
•

We conducted extensive experiments on the EvoChart method and EvoChart-QA. Results demonstrate that the EvoChart method significantly outperforms other data synthesis methods, and we also deeply analyze the performance of various VLMs on EvoChart-QA.We have made EvoChart and EvoChart-QA publicly available at https://github.com/MuyeHuang/EvoChart

2 Related Work

2.1 Chart Question Answering Datasets

Since FigureQA [12] pioneered the CQA task, numerous datasets for chart question answering have emerged. Synthetic datasets, such as DVQA [10], PlotQA [24], RealCQA [2], ChartX [31] and UniChart [19]. This datasets utilize synthetically generated charts or templat-base questions. Generate datasets such as ChartSFT [23], utilize a mixture of GPT-4 [25]-generated charts and questions. Mixed datasets as ChartQA [20] and Charxiv [30], the former is a dataset compiled semi-manually with the assistance of templates, while the latter is template-based and requires evaluation by GPT-4o. In contrast, EvoChart-QA focus on real-world scenarios and employ an automated evaluation method that does not necessitate the utilization of GPT-4.

2.2 Visual Language Models on CQA

VLMs are language models with visual understanding capabilities, and they have numerous applications in CQA tasks. Small VLMs like ChartReader [6], MatCha [16] and UniChart [19] have shown superior performance on tasks like PlotQA and DVQA, highlighting the potential of VLMs in CQA. ChartLlama [9] was a milestone, being the first to apply LLaVa1.5 [17] to CQA tasks and achieving impressive performance. Subsequently, works like ChartInstruct [21], ChartAst-D [23], and TinyChart [35] delved into the multimodal alignment and CQA reasoning aspects of VLMs in CQA, achieving remarkable performance. Recently, open-source general VLMs such as Phi3-Vision [1] and Intern-VL2.0 [5], through large-scale training, have achieved state-of-the-art performance on the ChartQA dataset.

2.3 Self-Training Approach

With the increasing capabilities of language models, numerous researchers have begun to explore the potential of leveraging language models for self-training. GPT3Mix [34] proved that large language model augmentation of textual corpora is very effective. Further, ReST [8] achieves cost-effective and efficient human preference alignment through a dual-loop self-training approach. Dennis et al. [28] obtained new data from the self-talk of multi-role-playing LLM Agents by adding a filtering check mechanism, realizing efficient self-training. Recently, Xu et al. [32] proposed ENVISIONS, which uses a neural-symbolic self-training approach to significantly improve mathematical and logical reasoning abilities without relying on external stronger models or evaluation tools. Inspired by their work, our proposed EvoChart focuses on a scalable self-training process.

3 EvoChart Method

We introduce the EvoChart, a unique self-training data synthesis approach that simultaneously produces high-quality training corpus and a high-performance chart understanding model. EvoChart comprises three iterative phases: Compositional Chart Generation for generating charts with diverse appearance, Chart Evaluation and Refinement to select and refine charts suitable for the current stage, and QA-pair generation and training to produce training data and provide a stronger model for the subsequent stage. These phases operate cyclically throughout the construction of EvoChart, as illustrated in Figure 2. We will explore each of these steps in the following subsections.

3.1 Compositional Chart Generation

Compositional chart generation aims to produce high-quality and diverse charts at minimal cost, serving as the core component of EvoChart’s construction. Previous approaches [9, 23, 35] either rely solely on GPT-4 for chart generation result in limited diversity and high costs, or using plotting libraries for random generation often leads to unrealistic themes and styles. To achieve continuous, low-cost, and diverse chart creation, we propose a two-step generation strategy within the compositional chart generation process:

1) Chart Code Seed Generation: This step serves as the initial phase of chart generation, aiming to produce fundamental chart code containing elements that are difficult to achieve via random processes. These elements include chart themes and appropriate units for the x- and y-axes. Notably, this step is executed only once during the entire chart dataset construction process. As [33] demonstrated the feasibility of large language models for code generation, we employ sophisticated prompt engineering techniques to guide GPT-4 in generating a large number of real-world chart code seeds. Our prompts for GPT-4 are based on the following key aspects:

Chart Types: Different chart types are suited for different themes. We focus on the four most prevalent real-world chart types: line charts, bar charts, pie charts, and scatter charts. For each of these chart types, we generate themes that are specifically tailored to their characteristics and use cases.

Chart Themes: Chart themes are challenging to generate through any random process, and manual curation is both costly and prone to domain bias. By utilizing prompting GPT-4, we have generated 25,000 themes across over 200 domains, including politics, economics, technology and everyday life. These themes encompass various titles, units, and other relevant elements.

Chart Color Schemes: Chart color schemes significantly influence the visual appearance of a chart. We employ an automated approach to generate over 200 color palettes, ensuring aesthetically pleasing chart appearances. These color schemes include diverse colors for lines, bars, segments, and backgrounds, among other visual elements.

Through these cost-effective, one-time steps, we have generated 25,000 chart code seeds. These seeds exhibit diverse themes and color palettes, and retain hundreds of rich configuration options, forming the foundation for subsequent low-cost data generation.

2) Composable Chart Generator: The Composable Chart Generator is responsible for producing diverse charts. It is invoked multiple times during the construction process. This step automates the creation of a wide variety of charts by randomly assigning configurations to the chart code. We have defined over a hundred configuration options, each with dozens of potential values. The generator automatically selects these options based on the Code Seed, ensuring diverse chart outputs. Due to the numerous configuration options, we will highlight a few key aspects below:

Chart Data: Numerical data constitutes the core message conveyed by a chart. While randomly generated data may result in excessively volatile values and unrealistic visualizations, we ensure that the generated chart data is adhered to the specified ranges provided in the Code Seed. This ensures that data values remain within the reasonable bounds defined by the chosen theme.

Numeric Labels: The numeric label within charts is a crucial configurable option. For each legend in chart, we randomly decide whether to display numeric labels. If labels are displayed, their positions relative to the data points (above, below, etc.) are also randomly assigned. For label font color, we randomly choose whether it should match the legend color or be assigned a different color independently.

Axis Tick Interval: Real-world chart creators often omit some labels on axes. For any continuous axis label (e.g., year, month, quarter), we set a 25% probability of no omission, a 50% probability of omitting one out of three labels, and a 25% probability of omitting two out of four labels.

Other Configurations: A multitude of detailed configurations influence chart appearance, including line width, line style (solid or dashed), bar stacking, axis visibility, font size, font type, and more. We introduce randomness into these configurations through a range of selectable options, and we employ ECharts [13] for rendering charts.

3.2 Chart Evaluation and Refinement

Chart Evaluation and Refinement enhances the chart images generated by the Compositional Chart Generation process. While chart code seeds can produce diverse charts, refinement remains crucial due to the following reasons: 1) Random generation can lead to visually similar charts, causing overfitting and reducing the model’s generalization ability. 2) Seed-based chart construction may result in poor chart aesthetics, negatively impacting data quality. To address these issues, we propose two steps: the Chart Evaluator and the Action Space. In the $k$ -th stage of data synthesis (Stage- $k$ ), the Chart Evaluator assigns a multi-dimensional evaluation score $e_{k}$ to the charts. Based on the difference $\Delta E$ between $e_{k}$ and the previous score $e_{k-1}$ , the Action Space selects an action to modify the charts. The detailed process is described below.

The Chart Evaluator uses the current stage model to assess chart quality, producing a multi-dimensional evaluation score. To avoid hallucinations from questions like “Does this chart have flaws?”, we assess quality using a directness-based question-answering approach. The Action Space then selects actions to refine the charts based on the evaluation scores. A detailed list of action types is in the Appendix. The evaluation questions and actions are as follows:

Is-Chart & Is-Title-Clear: These questions check if the chart is correctly rendered. While existing VLMs struggle to comprehend charts in detail, they can still distinguish the names of different chart types. Therefore, we propose the following questions. For example, “Is the image a horizontal bar chart?” If the model answers incorrectly, the action is “Drop.” If correct, the action is “None.”

Label-Value & Value-Label: This question type evaluates chart quality by examining text-value alignment. For example, “What is the value of Medication in May?” We generate 10 questions per chart and calculate the average accuracy $e_{k}$ . If $e_{k}-e_{k-1}$ is significantly positive, the chart may be too simple, prompting a “value enhancement method.” If significantly negative, it may indicate errors or overlaps, prompting the “Drop” action.

Label-Visual & Visual-Label: This evaluates visual-text alignment, for example, “What is the bar color of Medication?” We generate 10 questions per chart and calculate accuracy $e_{k}$ . If $e_{k}-e_{k-1}$ is significantly positive, the chart’s visual information may be too simple, prompting a “visual enhancement method.” If significantly negative, it may indicate visual errors, prompting the “Drop” action.

Through Chart Evaluation and Refinement, we ensure that EvoChart generates accurate and challenging data relative to the current stage model in each stage. This ensures data diversity and simultaneously prevents the EvoChart model from overfitting to the EvoChart Corpus.

Stage	Type	Line	Bar	Pie	Scatter	Total
Stage1	Chart	9,074	9,772	4,594	3,878	27,318
Stage1	Query	315,902	203,898	31,533	90,851	642,184
Stage2	Chart	17,280	17,293	7,720	6,259	48,462
Stage2	Query	599,078	360,827	52,866	146,631	1,159,402
Stage3	Chart	24,279	22,562	11,004	9,316	67,161
Stage3	Query	841,745	470,771	75,356	218,227	1,606,099

Table 1: EvoChart Corpus statistics

3.3 QA-pair Generation and Training

QA-pair Generation and Training aims to generate question-answer pairs based on charts, incorporating these data into the EvoChart Corpus and training the EvoChart model for the next stage. We generate question-answer pairs using various question templates. Notably, we focus on basic chart understanding, the templates specifically focus on the alignment of visual-text-value information in charts (e.g., extracting values through visual information, extracting visual information through text). Additionally, we generate rich question-CoT pairs using composable vCoTs [26] and distinguish Direct from vCoT using Instruct. Since vCoTs solely serve to enhance model comprehension, we only mix in 20% of vCoT data in the training data. We have established 198 question templates with corresponding answers including Direct and over 500 vCoT templates. The statistical data for the EvoChart Corpus is presented in Table 1. A detailed template can be found in the Appendix.

4 EvoChart-QA Benchmark

EvoChart-QA is a comprehensive and challenging benchmark for real-world chart understanding. We carefully selected 625 charts with diverse appearances, all sourced from real-world websites. Then we curated 1250 chart-based understanding questions through human experts. This process ensures that EvoChart-QA accurately reflects real-world scenarios. The comparison between EvoChart-QA and other benchmarks is shown in Table 3. In the following sections, we will elaborate on the chart selection process, question construction methods, and evaluation metrics used.

4.1 Chart Selection

To enable EvoChart-QA to emulate real-world chart understanding scenarios, all charts in our dataset are carefully selected by human experts. Specifically, we crawled 1,000 charts from 140 different websites. Human experts then filtered out images with ambiguous meanings or damages, resulting in a final dataset of 625 valid images. These images include line charts, bar charts, pie charts, and scatter plots. The distribution of these images is shown in Table 2. Examples and sources of all images are detailed in the Appendix.

Dataset	Type	Line	Bar	Pie	Scatter	Total
ChartQA-test	Chart	134	405	86	-	625
ChartQA-test	Query	268	810	172	-	1,250
EvoChart-QA	Chart	200	200	125	125	650
EvoChart-QA	Query	400	400	225	225	1,250

Table 2: Comparison of EvoChart-QA and ChartQA Benchmark

4.2 Question Construction

We focus on chart basic understanding questions. Following the definitions of prior researchers [11, 24], we concentrate on data and structural retrieval questions. Specifically, we categorize the problems into two types during manual construction: Direct Retrieval and Complex Retrieval. Direct Retrieval questions focus on understanding the image and directly extracting its content, while Complex Retrieval questions emphasize performing multiple visual reasoning steps on the chart.

1) Direct Retrieval. Direct Retrieval aims to directly extract elements from a chart based on the question. To comprehensively assess the model’s ability to extract various elements from charts, we categorize chart elements into three types: label, value, and visual. Label elements refer to textual content in the chart, such as chart title, axis labels, etc. Value elements refer to data values conveyed by the chart, which are displayed or implicitly provided based on the chart author’s intention. Visual elements refer to all visual descriptions in the chart, such as line color, largest segment, etc. For example: “What is the value of the green dashed line in 2015?” Although Direct Retrieval only involves extracting elements from charts, it remains a challenging task. On the one hand, real-world charts often exhibit non-standard variations. For example, they may include rich text such as numerous logos inserted in the image, or they may combine multiple chart forms within a single chart to facilitate expressions. On the other hand, questions posed by real-world users may contain ambiguous expressions. For example, “Which country’s total GDP is represented by the bar in the lightest shade of blue?” This visual description is ambiguous, but it is a clearly question for human observers.

Name	Real Data	Real Chart	Open Vocab	Human Query	Multi Souce	Flex Eval
FigureQA	✗	✗	✗	✗	✗	✗
DVQA	✗	✗	✗	✗	✗	✗
PlotQA	✓	✗	✓	✗	✗	✗
ChartQA	✓	✓	✓	✓	✗	✗
CharXiv	✓	✓	✓	✗	✗	✗
Ours	✓	✓	✓	✓	✓	✓

Table 3: Comparison with different benchmarks

Model	Line			Bar			Pie			Scatter			Overall
Model	Dir.	Comp.	All	Dir.	Comp.	All	Dir.	Comp.	All	Dir.	Comp.	All	Dir.	Comp.	All
Proprietary Models
Gemini-1.5-Flash	26.7	17.1	25.0	28.4	20.7	26.8	41.9	22.9	33.8	33.5	19.4	29.3	30.5	20.3	27.9
Gemini-1.5-Pro	42.1	21.4	38.5	28.4	20.7	26.8	41.9	22.9	33.8	33.5	19.4	29.3	36.0	21.2	32.2
Qwen-VL-Plus	22.7	11.4	20.8	28.8	13.8	25.5	33.3	9.4	23.1	27.2	13.4	23.1	27.0	11.9	23.1
Qwen-VL-Max	35.5	17.1	32.2	44.4	23.0	39.8	48.1	25.0	38.2	33.5	19.4	29.3	39.9	21.6	35.2
GPT-4-turbo	40.0	25.7	37.5	44.7	29.9	41.5	55.0	34.4	46.2	46.8	14.9	37.3	44.8	27.2	40.3
GPT-4o	52.7	32.9	49.2	52.7	44.8	51.0	53.5	49.0	51.6	56.3	23.9	46.7	53.4	39.1	49.8
Open-source Models
Phi3-Vision-4B	43.3	27.1	40.5	47.9	27.6	43.5	33.3	27.1	30.7	50.0	14.9	39.6	44.6	24.7	39.5
QwenVL-Chat-7B	20.6	17.1	20.0	18.2	9.2	16.2	31.0	14.6	24.0	24.7	11.9	20.9	21.9	13.1	19.7
LlaVa1.6-Vicuna-7B	25.8	14.3	23.8	24.9	19.5	23.8	38.0	15.6	28.4	21.5	11.9	18.7	26.5	15.6	23.7
Intern-VL-2.0-8B	38.5	27.1	36.5	45.7	29.9	42.2	43.4	27.1	36.4	44.9	22.4	38.2	42.7	26.9	38.6
Llama3-Next-8B	20.3	5.7	17.8	22.4	16.1	21.0	24.0	21.9	23.1	20.9	16.4	19.6	21.6	15.6	20.1
CogVLM2-19B	24.8	11.4	22.5	28.8	10.3	24.8	27.9	5.2	18.2	24.7	7.5	19.6	26.6	8.4	21.9
LlaVa1.6-YI-34B	5.8	10.0	6.5	7.7	4.6	7.0	13.2	5.2	9.8	9.5	6.0	8.4	8.1	6.2	7.6
Intern-VL-2.0-40B	53.3	42.9	51.5	54.3	37.9	50.7	55.8	37.5	48.0	51.9	20.9	42.7	53.8	35.3	49.0
Chart Expert Models
ChartLlama-13B	7.3	4.3	6.8	7.3	10.3	8.0	21.7	6.2	15.1	13.9	6.0	11.6	10.4	6.9	9.5
ChartAst-S-13B	12.4	12.9	12.5	14.4	14.9	14.5	14.7	7.3	11.6	13.9	12.0	12.0	13.7	10.6	12.9
ChartIns-Llama2-7B	17.9	11.4	16.8	16.0	19.5	16.8	27.1	13.5	21.3	13.9	9.0	12.4	17.8	13.8	16.8
ChartIns-FlanT5-3B	23.6	24.3	23.8	28.4	16.1	25.8	40.3	19.8	31.6	13.9	19.4	15.6	25.9	19.7	24.3
ChartGemma-2B	33.9	25.7	32.5	29.1	25.3	28.2	36.4	30.2	33.8	32.9	16.4	28.0	32.5	25.0	30.6
TinyChart-3B	24.5	15.7	23.0	28.4	17.2	26.0	33.3	15.6	25.8	33.5	17.9	28.9	28.6	16.6	25.5
EvoChart-4B	62.1	32.9	57.0	62.3	33.3	56.0	64.3	30.2	49.8	55.1	37.3	49.8	61.3	33.1	54.2

Table 4: Experimental results on EvoChart-QA using various open-source or proprietary models. Due to space constraints, abbreviations are used: Dir. refers to Direct, Comp. refers to Complex.

2) Complex Retrieval. Complex Retrieval involves querying information within a chart using complex, multi-step descriptions. Compared to Direct Retrieval, Complex Retrieval focuses on understanding the relative positions of elements within the chart. For example, “In the chart, the third bar to the left of the longest red bar represents the GDP of which country?”. Complex retrieval poses novel challenges for chart comprehension. This is because the descriptive information in complex retrieval relies entirely on the chart itself, requiring the model to have a comprehensive and clear understanding of the chart. For example, comprehending the previously mentioned “the longest red bar” is entirely based on the extraction of information from the chart’s content. Furthermore, this also necessitates the model to possess sophisticated visual reasoning capabilities, such as understanding “the third bar from the left” which demands visually-grounded inference.

4.3 Evaluation Metrics

We designed a automatic evaluation method for EvoChart-QA, combining flex and strict approaches, to fairly evaluate answer correctness. In EvoChart-QA construction, we label questions as “Strict” or “Flex” and use the corresponding Strict or Flex approach to evaluate correctness. For “Strict” type questions, answers have a definite value, such as numerical or textual values explicitly labeled in the chart. We employ a zero-tolerance approach for judging these questions. For “Flex” type questions, answers have estimated values, such as unlabeled numerical values. We employ a 5% tolerance approach to judge these questions. Finally, we employ average accuracy to evaluate the model’s performance. In contrast to our metrics, previous methods like ChartQA allowed a 5% tolerance for any numerical answer, leading to an optimistic estimation of model outputs. For example, years are numerical answers, and 1995 and 2008 would fall within the 5% tolerance in previous evaluation metrics, and our method does not exhibit this error.

5 Experiments

5.1 Setup

Datasets. To comprehensively evaluate the effectiveness of EvoChart, we chose to test it on both ChartQA and EvoChart-QA. ChartQA [20] is a dataset with two subsets: “Augment”, which is machine-generated, and “Human”, which is manually curated. “Augment” focuses on element extraction tasks within machine-synthesized images, while “Human” emphasizes complex numerical and logical reasoning tasks in real-world charts. EvoChart-QA is a novel real-world benchmark that we proposed.

Models. We conducted extensive evaluations on both open-source and proprietary models. For open-source models, we tested Phi3-Vision-4B [1], QwenVL-Chat-7B [3], LlaVa1.6-Vicuna-7B [17], Intern-VL-2.0-8B [5], Llama3-Llava-Next-8B [14], CogVLM2-19B [29], LlaVa1.6-YI-34B, Intern-VL-2.0-40B, ChartLlama-13B [9], ChartAst-S-13B [23], ChartIns-Llama2-7B [21], ChartIns-FlanT5-3B, ChartGemma-2B [22], and TinyChart-3B [35]. For proprietary models, we tested Gemini-1.5-Flash [27], Gemini-1.5-Pro, Qwen-VL-Plus, Qwen-VL-Max, GPT-4-turbo [25], and GPT-4o. For all models, we employed a zero-shot approach. The specific configurations of all models are provided in the Appendix.

Settings. In EvoChart method, we utilize Phi3-Vision [1] as the initialization model. We conducted a 3-Stage data synthesis and training process, with each Stage undergoing 1 Epoch of SFT with a learning rate of 2e-5 and using cosine learning rate scheduler. All experiments were completed on 4 NVIDIA A800 80G GPUs.

5.2 Experimental Results

Tables 4 and 5 present the performance of EvoChart and other open-source or proprietary VLMs on EvoChart-QA and ChartQA. We elaborate on the experimental results in two aspects: EvoChart-QA and EvoChart.

EvoChart-QA Results. All models exhibit relatively poor performance on EvoChart-QA, with accuracies not exceeding 55%. Among proprietary models, GPT-4o achieves the highest accuracy at 49.8%. InternVL-2.0-40B demonstrates the strongest performance among open-source general-purpose models, reaching 49.0%. Within the domain of chart-expert models, our proposed EvoChart method yields the best-performing model, achieving the highest accuracy across all models at 54.2%. We summarize our findings as follows:

1) EvoChart-QA presents a substantially more challenging benchmark for evaluating basic chart comprehension. Even without involving numerical reasoning or calculation, these models experience a significant performance drop of 30-50% on EvoChart-QA compared to their scores on ChartQA. This challenge stems from the diversity of charts sourced from 140 websites and the meticulously crafted questions that comprise our dataset.

2) All models demonstrate significantly weaker performance on Complex Retrieval compared to Direct Retrieval, indicating that reasoning over visual information poses a substantially greater challenge than direct extraction of information from charts. Furthermore, nearly all models exhibit lower accuracy on Pie and Scatter chart types compared to their average performance. This suggests that Pie and Scatter charts pose a greater challenge for these models to understand.

3) Open-source general-purpose models exhibit comparable chart comprehension abilities to proprietary models. This suggests that models pretrained on large-scale chart data possess strong generalization capabilities. However, while chart expert models fine-tuned on specific domains achieve impressive scores on the ChartQA dataset, their performance degrades significantly to below 30% when confronted with the entirely out-of-distribution (OOD) EvoChart-QA dataset.

Model	ChartQA	EvoChart-QA
Gemini-1.5-pro	81.3	32.2
GPT-4-turbo	62.3	40.3
GPT-4o	85.7	49.8
CogVLM2-19B	81.0	21.9
Phi3-Vision-4B	81.4	38.6
Intern-VL-2.0-8B	81.5	49.0
ChartAst-S-13B	79.9	12.9
TinyChart-3B	83.6	25.5
EvoChart-4B	81.5	54.2

Table 5: Comparison on ChartQA and EvoChart-QA

EvoChart Results. Among all proprietary and open-source models evaluated, our proposed EvoChart trained model exhibits significantly superior performance, achieving an accuracy of 54.2%, surpassing GPT-4o 49.8%. We have the following observations:

1) Although the EvoChart model is trained on synthetic data, it achieves state-of-the-art performance on the entirely OOD real-world benchmark EvoChart-QA and exhibits competitive performance on the chart reasoning task ChartQA. This validates the strong generalization ability of the EvoChart approach.

2) EvoChart primarily focuses on chart basic comprehension. However, as demonstrated in Table 5, EvoChart remains one of the top-performing chart expert models on ChartQA. This is an intriguing finding, suggesting that basic chart comprehension serves as a cornerstone for chart reasoning tasks, and training on basic comprehension can enhance performance in chart reasoning tasks.

3) EvoChart’s complex retrieval capabilities are inferior to those of InternVL-2.0-40B and GPT4o. This is reasonable, as these models possess significantly larger scales, which confer an inherent advantage in complex visual extraction and reasoning tasks.

5.3 Analysis

EvoChart Ablation Study. We conducted comprehensive ablation studies on EvoChart, and the results are summarized in Table 6. We trained and generated EvoChart for 1 to 3 stages. Meanwhile, to verify the effectiveness of Chart Evaluation and Refinement, we set up EvoChart without refinement and trained it for 3 stages, denoted as “w/o refine stage-3.” Based on the experimental results presented in the Table6 and Tabel 1, we observed the following:

1) As the number of EvoChart method Stages increases, the scale of the EvoChart-Dataset expands, and the performance of the EvoChart steadily improves. This demonstrates the effectiveness of EvoChart, a multi-round self-training approach.

2) Despite having access to a larger training dataset, the “w/o refine stage-3.” exhibits significantly lower performance on EvoChart-QA compared to the complete EvoChart method. This indicates the effectiveness of Chart Evaluation and Refinement in enhancing the model’s generalization ability within the EvoChart.

Model	Line	Bar	Pie	Scatter	Overall
w/ refine stage-1	53.2	51.5	46.7	44.9	50.0
w/ refine stage-2	53.5	54.2	49.8	48.0	52.0
w/ refine stage-3	57.0	56.0	49.8	49.8	54.2
w/o refine stage-3	52.5	50.7	43.6	45.3	49.0

Table 6: Ablation Study Results for EvoChart on EvoChart-QA

Case Study. To further analyze EvoChart and EvoChart-QA, we selected samples from the EvoChart-QA Benchmark for analysis. Figure 3 presents four cases. More cases are provided in the Appendix. As shown in the figure, overall, EvoChart-QA offers diverse charts and questions, and EvoChart achieves more accurate chart understanding performance compared to GPT4o. Q2 and Q4 demonstrate the effectiveness of our Strict/Flex Metrics. For values explicitly labeled in the image, there should be zero tolerance. However, for estimation questions like Q1, a 5% tolerance is allowed. Furthermore, Q2 highlights EvoChart’s capability for precise chart understanding in complex scenarios.

6 Conclusion

In this paper, we introduce EvoChart and EvoChart-QA: a novel approach for enhancing chart comprehension capabilities through self-training and iterative synthetic data generation, and a meticulously crafted real-world chart comprehension benchmark. We aim to provide a new avenue for real-world chart understanding through EvoChart and EvoChart-QA. Through extensive evaluation, we expose the limitations of existing VLMs in chart comprehension and validate the effectiveness of our EvoChart method across multiple datasets. In the future, we will continue to explore the integration of human-effort-free self-training methods with chart comprehension to further advance the field.

7 Appendix

7.1 Modified ChartQA

1) Result of Modified ChartQA.

The experimental results on Modified ChartQA using three VLMs, as mentioned in the Introduction, are presented in Table 7. Even the best-performing Chart Expert model exhibits a performance drop of nearly 30%.

Model	ChartQA-Avg	ChartQA-Modified
Gemini-1.5-pro	81.3	18.1 $\downarrow$ 63.2
GPT-4o	85.7	45.8 $\downarrow$ 39.9
Phi-3-Vision	81.4	52.6 $\downarrow$ 28.8
TinyChart	83.6	45.8 $\downarrow$ 37.8

Table 7: Comparison results (%) on modified CharQA

2) Case of Modified ChartQA.

Examples of the Modified ChartQA cases, as mentioned in the Introduction, are illustrated in Figure 8 and Figure 9. The complete set of 103 questions will be released upon publication of this article.

7.2 EvoChart-QA Detailed Case

An overview of the EvoChart-QA Benchmark is shown in Figure 4. Compared to Charxiv, PlotQA, DVQA, and ChartQA, our proposed EvoChart-QA Benchmark have 140 sources and 1250 manually annotated questions, providing a more realistic evaluation benchmark. This broader scope and meticulous annotation contribute to a more comprehensive and robust assessment of chart comprehension capabilities. Figures 5, 6, and 7 present more detailed examples from the EvoChart-QA dataset.

7.3 EvoChart Action Space Types

Table 8 presents the categories of actions within EvoChart. The three types correspond to the question types mentioned in the main text. Specifically, VaEM represents Value Enhancement Method, and ViEM represents Visual Enhancement Method. For each question, there is a possibility of dropping its corresponding chart. For questions where VaEM or ViEM is selected, a random action will be chosen to modify the corresponding chart.

Detailed Experiments settings

For all open-source general-purpose VLMs, proprietary VLMs, and chart expert models, we employ a zero-shot prompting approach.

For open-source general-purpose and proprietary models, we use the following prompt: “You will play as a chart reading expert. You should ONLY give the answer STRING or NUMBER, without any units. You should Not Give Any Explanation.” This is because general-purpose models have undergone extensive instruction fine-tuning and alignment with human preferences, thus requiring a more detailed prompt to regulate their output.

For chart expert models, we utilize their respective training instructions. For the EvoChart model, the prompt is as follows: “You will play as a chart reading expert. You should just give the answer, without any explanation or units.” This is because chart expert models have been fine-tuned with specific instructions during their training process.

Type	Actions
Is-Chart & Is-Title-Clear	Drop
Is-Chart & Is-Title-Clear	None
Label-Value & Value-Label	Drop
	None
	VaEM: Rand Num
	VaEM: More Legends
	VaEM: Change Num-Scale
Label-Visual & Visual-Label	Drop
	None
	ViEM: Shuffle Color
	ViEM: Change Axis-Scale
	ViEM: Change Color Schemes
	ViEM: Switch Legend Position

Table 8: List of Actions in the EvoChart Action Space

7.4 EvoChart Method Question Template

We have established 284 distinct QA-Pair Templates, as outlined below:

•

Can you tell me the value of {legend_label} in {xlabel}?
•

I’d like to know the value of {legend_label} within {xlabel}.
•

Could you provide the value of {legend_label} found in {xlabel}?
•

What amount does {legend_label} have in {xlabel}?
•

Please specify the value of {legend_label} in the context of {xlabel}.
•

What is the value of {legend_label} in {xlabel} ?
•

Can you identify the legend label with a value of {value_label} at the position marked by {xlabel}?
•

What legend label shows a value of {value_label} at the point {xlabel}?
•

Could you tell me which legend label corresponds to the value {value_label} at {xlabel}?
•

Which label in the legend has the value {value_label} at the {xlabel} position?
•

Identify the legend label with a value of {value_label} at {xlabel}, please.
•

Which legend label has a value of {value_label} at the position of {xlabel} ?
•

List the values at {xlabel} from bottom to top.
•

Give me the values at {xlabel} arranged in a list from bottom to top.
•

Could you provide the values at {xlabel} in a list, starting from the bottom and going to the top?
•

Please provide a list of the values at {xlabel} in order from bottom to top.
•

I’d like the values at {xlabel} in a list format, ordered from bottom to top.
•

Provide the values at {xlabel} in a list format from bottom to top.
•

What are the data values in ascending order on the x-axis tick right before {xlabel}?
•

Can you list the data values from smallest to largest on the x-axis tick just to the left of {xlabel}?
•

Please provide the data values sorted from smallest to largest for the x-axis tick immediately preceding {xlabel}.
•

Could you tell me the data values from smallest to largest at the x-axis tick just before {xlabel}?
•

What are the data values, ordered from smallest to largest, on the x-axis tick directly left of {xlabel}?
•

On the x-axis tick immediately to the left of {xlabel}, what are the data values from smallest to largest?
•

What legend label features a {line_color} {line_style} line?
•

Which legend key shows a {line_color} {line_style} line?
•

Can you identify the legend label with a {line_color} {line_style} line?
•

Which label in the legend corresponds to a {line_color} {line_style} line?
•

Could you tell me which legend label has a {line_color} {line_style} line?
•

Which legend label has a {line_color} {line_style} line?
•

Can you tell me the line style for {legend_label}?
•

What kind of line style does {legend_label} use?
•

Please specify the line style associated with {legend_label}.
•

How is the line style defined for {legend_label}?
•

What’s the type of line style for {legend_label}?
•

What is the line style of {legend_label}?
•

Can you tell me the value of the {n}th data point from the left on the {line_color} line that is {line_style}?
•

What is the value of the {n}th point from the left on the {line_style} line that is colored {line_color}?
•

Please provide the value of the {n}th data point from the left on the {line_color} {line_style} line.
•

Could you specify the value of the {n}th point from the left on the {line_style} line in {line_color}?
•

What value does the {n}th point from the left have on the {line_color} line with {line_style}?
•

What is the value of the {n}th data point from left to right on the {line_style} line of {line_color} color?
•

When the line labeled {legend_label} hits the {value_label} mark at {xlabel}, how many lines is it positioned above?
•

At {xlabel}, when the {legend_label} line reaches the {value_label} level, how many other lines is it above?
•

How many lines are beneath the {legend_label} line when it reaches {value_label} at {xlabel}?
•

When {legend_label} hits the value {value_label} at {xlabel}, how many lines are below it?
•

How many lines does the line labeled {legend_label} surpass at {xlabel} when it attains the {value_label} value?
•

When the line represented by {legend_label} reaches the value {value_label} at {xlabel}, how many lines is this line above?
•

At which x-label does the line represented by {legend_label} reach its highest point?
•

Where does the line indicated by {legend_label} peak on the x-axis?
•

Can you identify the x-label where the line for {legend_label} reaches its maximum value?
•

Which x-label corresponds to the highest point of the line denoted by {legend_label}?
•

At what x-label is the peak of the line marked by {legend_label}?
•

The highest point of the line represented by {legend_label} is at which x-label?
•

Where on the x-axis does the line labeled {legend_label} dip to its lowest value?
•

Can you tell me the x-label where the {legend_label} line hits its minimum?
•

At which point on the x-axis does the {legend_label} line bottom out?
•

What’s the x-label where the line for {legend_label} reaches its lowest point?
•

Where along the x-axis does the {legend_label} line find its lowest value?
•

At which x-label does the line represented by {legend_label} reach its lowest point?
•

At which x-label does the line with color {line_color} and style {line_style} reach its lowest point?
•

Where along the x-axis does the {line_color} line, with its {line_style} style, hit the bottom?
•

Can you tell me the x-label where the {line_color} line, designed in {line_style}, drops to its lowest?
•

Which x-label marks the lowest point of the {line_color} line that’s styled as {line_style}?
•

What’s the x-label where the {line_color} line, styled in {line_style}, touches its lowest point?
•

Which xlabel does the {line_color} line, styled as {line_style}, reach its bottom?
•

At which xlabel does the {line_color} colored line with {line_style} reach its peak?
•

Can you tell me the value of {legend_label} in {xlabel}?
•

I’d like to know the value of {legend_label} within {xlabel}.
•

Could you provide the value of {legend_label} found in {xlabel}?
•

What amount does {legend_label} have in {xlabel}?
•

Please specify the value of {legend_label} in the context of {xlabel}.
•

What is the value of {legend_label} in {xlabel} ?
•

What is the legend label for the value {value_label} on the {xlabel} axis?
•

Identify the legend label that corresponds to the value {value_label} in {xlabel}.
•

Which legend label matches the value {value_label} in {xlabel}?
•

Find the legend label associated with the value {value_label} in {xlabel}.
•

Can you tell me the legend label that has the value {value_label} in {xlabel}?
•

Which legend label have value of {value_label} in {xlabel} ?
•

Which legend label has a value of {value_label} at the position of {xlabel}?
•

At the position of {xlabel}, which legend label corresponds to the value {value_label}?
•

Identify the legend label that has a value of {value_label} at the {xlabel} position.
•

What legend label holds the value {value_label} at the position indicated by {xlabel}?
•

Determine the legend label with a value of {value_label} at the {xlabel} location.
•

Which legend label shows a value of {value_label} at the position marked by {xlabel}?
•

Provide the values at {xlabel} in a list format rating from small to large.
•

Can you list the values at {xlabel} from small to large?
•

Please arrange the values at {xlabel} in a list from small to large.
•

List the values at {xlabel} in ascending order.
•

I’d like to see the values at {xlabel} rated from small to large in a list.
•

Can you provide a list of the values at {xlabel} from the smallest to the largest?
•

Could you please give the values at {xlabel} in a list format, ordered from small to large?
•

What is the value of the {n}th data point from bottom to top on the {border_type} border bar of {line_color} color?
•

Could you tell me the value of the {n}th data point from the bottom to top on the {border_type} border bar in {line_color}?
•

Please provide the value of the {n}th data point from bottom to top on the {line_color} {border_type} border bar.
•

What’s the value of the {n}th data point from bottom to top on the {line_color} {border_type} border bar?
•

I need the value of the {n}th data point from the bottom to the top on the {border_type} border bar of {line_color}.
•

Can you find the value of the {n}th data point from bottom to top on the {line_color} border bar of {border_type} type?
•

Please tell me the value of the {n}th data point from bottom to top on the {border_type} border bar that is {line_color}.
•

What is the value of the longest {line_color} bar?
•

Can you tell me the value of the tallest {line_color} bar?
•

I need to know the value of the highest {line_color} bar.
•

What is the value of the {line_color} bar with the maximum height?
•

Please provide the value of the largest {line_color} bar.
•

Could you find out the value of the highest {line_color} bar?
•

What is the value of the shortest {line_color} bar?
•

Can you tell me the value of the smallest {line_color} bar?
•

I need to know the value of the least tall {line_color} bar.
•

What is the value of the {line_color} bar with the minimum height?
•

Please provide the value of the {line_color} bar that is the shortest.
•

Could you find out the value of the shortest {line_color} bar for me?
•

Can you tell me the value of {legend_label} in {xlabel}?
•

I’d like to know the value of {legend_label} within {xlabel}.
•

Could you provide the value of {legend_label} found in {xlabel}?
•

What amount does {legend_label} have in {xlabel}?
•

Please specify the value of {legend_label} in the context of {xlabel}.
•

What is the value of {legend_label} in {xlabel} ?
•

What is the legend label for the value {value_label} on the {xlabel} axis?
•

Identify the legend label that corresponds to the value {value_label} in {xlabel}.
•

Which legend label matches the value {value_label} in {xlabel}?
•

Find the legend label associated with the value {value_label} in {xlabel}.
•

Can you tell me the legend label that has the value {value_label} in {xlabel}?
•

Which legend label have value of {value_label} in {xlabel} ?
•

Which legend label has a value of {value_label} at the position of {xlabel}?
•

At the position of {xlabel}, which legend label corresponds to the value {value_label}?
•

Identify the legend label that has a value of {value_label} at the {xlabel} position.
•

What legend label holds the value {value_label} at the position indicated by {xlabel}?
•

Determine the legend label with a value of {value_label} at the {xlabel} location.
•

Which legend label shows a value of {value_label} at the position marked by {xlabel}?
•

Provide the values at {xlabel} in a list format rating from small to large.
•

Can you list the values at {xlabel} from small to large?
•

Please arrange the values at {xlabel} in a list from small to large.
•

List the values at {xlabel} in ascending order.
•

I’d like to see the values at {xlabel} rated from small to large in a list.
•

Can you provide a list of the values at {xlabel} from the smallest to the largest?
•

Could you please give the values at {xlabel} in a list format, ordered from small to large?
•

What is the value of the {n}th data point from left on the {border_type} border bar of {line_color} color?
•

Could you tell me the value of the {n}th data point from left on the {border_type} border bar in {line_color}?
•

Please provide the value of the {n}th data point from left on the {line_color} {border_type} border bar.
•

What’s the value of the {n}th data point from left on the {line_color} {border_type} border bar?
•

I need the value of the {n}th data point from left on the {border_type} border bar of {line_color}.
•

Can you find the value of the {n}th data point from left on the {line_color} border bar of {border_type} type?
•

Please tell me the value of the {n}th data point from left on the {border_type} border bar that is {line_color}.
•

What is the value of the longest {line_color} bar?
•

Can you tell me the value of the tallest {line_color} bar?
•

I need to know the value of the highest {line_color} bar.
•

What is the value of the {line_color} bar with the maximum height?
•

Please provide the value of the largest {line_color} bar.
•

Could you find out the value of the highest {line_color} bar?
•

What is the value of the shortest {line_color} bar?
•

Can you tell me the value of the smallest {line_color} bar?
•

I need to know the value of the least tall {line_color} bar.
•

What is the value of the {line_color} bar with the minimum height?
•

Please provide the value of the {line_color} bar that is the shortest.
•

Could you find out the value of the shortest {line_color} bar for me?
•

How many sectors are there in total in this pie chart?
•

How many segments are there in total in this pie chart?
•

What is the total number of sectors in this pie chart?
•

How many segments in total are present in this pie chart?
•

What is the total count of sectors in this pie chart?
•

How many segments does this pie chart have in total?
•

What is the percentage of ’sector_label’ in ’series_label’ on this chart?
•

How much percent does ’sector_label’ make up in ’series_label’ on this chart?
•

Can you tell me the percentage of ’sector_label’ within ’series_label’ in this chart?
•

What proportion of ’series_label’ does ’sector_label’ represent in this chart?
•

How large is the percentage of ’sector_label’ in the ’series_label’ shown on this chart?
•

In this chart, what percentage does ’sector_label’ constitute in ’series_label’?
•

What is the number of ’sector_label’ in ’series_label’ on this chart?
•

How many ’sector_label’ are there in ’series_label’ on this chart?
•

Can you find the number of ’sector_label’ within ’series_label’ in this chart?
•

What count of ’sector_label’ does ’series_label’ have in this chart?
•

How many instances of ’sector_label’ are in ’series_label’ on this chart?
•

In this chart, what is the count of ’sector_label’ in ’series_label’?
•

What percentage of ’series_label’ is made up by ’sector_label’ in this chart?
•

What is the proportion of ’sector_label’ in ’series_label’ on this chart?
•

Could you specify the fraction of ’sector_label’ within ’series_label’ depicted in this chart?
•

What ratio does ’sector_label’ contribute to ’series_label’ as shown in this chart?
•

How large is the share of ’sector_label’ in ’series_label’ on this chart?
•

In this chart, what part of series_label does ’sector_label’ represent?
•

What is the percentage of the sector with the {name_color} color?
•

How much percent does the sector with the {name_color} color represent?
•

Can you tell me the proportion of the sector with the {name_color} color?
•

What fraction does the sector with the {name_color} color make up?
•

How large is the percentage of the sector with the {name_color} color?
•

In this chart, what percentage does the sector with the {name_color} color constitute?
•

What is the number of the sector with the {name_color} color?
•

How many sectors are there with the {name_color} color?
•

Can you tell me the count of the sector with the {name_color} color?
•

What is the quantity of the sector with the {name_color} color?
•

How many sectors are labeled with the {name_color} color?
•

In this chart, what is the number of sectors with the {name_color} color?
•

What is the percentage of the sector with the {name_color} color?
•

How much of the sector is represented by the {name_color} color in percentage?
•

Can you tell me the percentage of the sector that is {name_color}?
•

What fraction of the sector is the {name_color} color?
•

How large is the share of the sector with the {name_color} color?
•

What portion of the sector does the {name_color} color represent in percentage?
•

What is the percentage of the largest sector in ’series_label’ in the pie chart?
•

In the pie chart, what percentage does the largest sector in ’series_label’ represent?
•

What proportion does the largest sector in ’series_label’ hold in the pie chart?
•

How much percentage does the largest sector in ’series_label’ account for in the pie chart?
•

Can you tell me the percentage of the largest sector in ’series_label’ on the pie chart?
•

What is the share of the largest sector in ’series_label’ in the pie chart?
•

What percentage of the pie chart does the smallest sector in ’series_label’ occupy?
•

In ’series_label’, what is the percentage of the smallest sector in the pie chart?
•

What is the fractional representation of the smallest sector in ’series_label’ within the pie chart?
•

How much does the smallest sector in ’series_label’ contribute to the pie chart as a percentage?
•

What is the smallest sector’s percentage in the pie chart under ’series_label’?
•

Within ’series_label’, what is the percentage value of the smallest sector in the pie chart?
•

Can you tell me the {y_axis_topic} of {x_value} {x_axis_topic} in {legend_name}?
•

What is the {y_axis_topic} of {x_value} {x_axis_topic} in {legend_name}?
•

Please provide the {y_axis_topic} for {x_value} {x_axis_topic} in {legend_name}.
•

Could you tell me the {y_axis_topic} for {x_value} {x_axis_topic} in {legend_name}?
•

What {y_axis_topic} corresponds to {x_value} {x_axis_topic} in {legend_name}?
•

Can you provide the {y_axis_topic} of {x_value} {x_axis_topic} in {legend_name}?
•

Could you provide the {x_axis_topic} for a {y_value} {y_axis_topic} in {legend_name}?
•

What is the {x_axis_topic} for a {y_value} {y_axis_topic} in {legend_name}?
•

Can you give the {x_axis_topic} for a {y_value} {y_axis_topic} in {legend_name}?
•

Please provide the {x_axis_topic} corresponding to a {y_value} {y_axis_topic} in {legend_name}.
•

Could you tell me the {x_axis_topic} for a {y_value} {y_axis_topic} in {legend_name}?
•

What {x_axis_topic} corresponds to a {y_value} {y_axis_topic} in {legend_name}?
•

How many legend labels are there in the chart?
•

What is the number of legend labels in the chart?
•

In the chart, how many legend labels are present?
•

How many labels are there in the chart’s legend?
•

What count of legend labels is shown in the chart?
•

Can you tell how many legend labels are included in the chart?
•

How many different colors of data points are there in the chart?
•

What is the number of different colors of data points in the chart?
•

In the chart, how many different colors of data points can be observed?
•

How many unique colors of data points are present in the chart?
•

What count of different colored data points is shown in the chart?
•

Can you tell how many different colors of data points are in the chart?
•

What is the {x_axis_topic} value of the {legend_name} at the peak {y_axis_topic} in this chart?
•

What is the {x_axis_topic} value of the {legend_name} at the peak {y_axis_topic} in this chart?
•

What {x_axis_topic} value corresponds to the peak {y_axis_topic} for the {legend_name} in this chart?
•

In this chart, what is the {x_axis_topic} value when {legend_name} reaches the peak {y_axis_topic}?
•

At the peak {y_axis_topic} for {legend_name} in this chart, what is the {x_axis_topic} value?
•

What is the {x_axis_topic} value when {legend_name} has the peak {y_axis_topic} in this chart?
•

In this chart, what {x_axis_topic} value aligns with the peak {y_axis_topic} of {legend_name}?
•

What is the corresponding {x_axis_topic} value when {legend_name} reaches its highest {y_axis_topic}?
•

When {legend_name} reaches its highest {y_axis_topic}, what is the corresponding {x_axis_topic} value?
•

What {x_axis_topic} value corresponds to the highest {y_axis_topic} for {legend_name}?
•

When {legend_name} has its highest {y_axis_topic}, what is the corresponding {x_axis_topic} value?
•

What is the {x_axis_topic} value at the highest {y_axis_topic} for {legend_name}?
•

When {legend_name} hits its highest {y_axis_topic}, what is the corresponding {x_axis_topic} value?
•

When {legend_name} reaches its highest point, what is the corresponding {x_axis_topic} value?
•

When {legend_name} is at its highest point, what is the corresponding {x_axis_topic} value?
•

What is the {x_axis_topic} value when {legend_name} reaches its highest point?
•

At the highest point of {legend_name}, what is the corresponding {x_axis_topic} value?
•

What {x_axis_topic} value corresponds to the highest point of {legend_name}?
•

When {legend_name} reaches its peak, what is the corresponding {x_axis_topic} value?
•

What is the {x_axis_topic} value of the {legend_name} with the lowest {y_axis_topic} in this chart?
•

In this chart, what {x_axis_topic} value corresponds to the {legend_name} with the minimum {y_axis_topic}?
•

For the {legend_name} with the lowest {y_axis_topic} in this chart, what is the {x_axis_topic} value?
•

What {x_axis_topic} value is associated with the {legend_name} that has the lowest {y_axis_topic} in this chart?
•

Which {x_axis_topic} value corresponds to the lowest {y_axis_topic} for the {legend_name} in this chart?
•

In this chart, what is the {x_axis_topic} value for the {legend_name} with the smallest {y_axis_topic}?
•

What is the corresponding {x_axis_topic} value when {legend_name} reaches its lowest {y_axis_topic}?
•

When {legend_name} has its lowest {y_axis_topic}, what is the corresponding {x_axis_topic} value?
•

What {x_axis_topic} value corresponds to the lowest {y_axis_topic} for {legend_name}?
•

For {legend_name} at its minimum {y_axis_topic}, what is the corresponding {x_axis_topic} value?
•

Which {x_axis_topic} value aligns with {legend_name} when it has the lowest {y_axis_topic}?
•

When the {y_axis_topic} is at its lowest for {legend_name}, what is the corresponding {x_axis_topic} value?
•

When {legend_name} reaches its lowest point, what is the corresponding {x_axis_topic} value?
•

What is the {x_axis_topic} value when {legend_name} hits its lowest point?
•

At the lowest point of {legend_name}, what is the corresponding {x_axis_topic} value?
•

What {x_axis_topic} value corresponds to the lowest point of {legend_name}?
•

When {legend_name} is at its lowest, what is the {x_axis_topic} value?
•

What {x_axis_topic} value aligns with the lowest point of {legend_name}?
•

Which legend has a {y_axis_topic} equal to {y_value} when the {x_axis_topic} is {x_value}?
•

Which legend shows a {y_axis_topic} of {y_value} when the {x_axis_topic} equals {x_value}?
•

When the {x_axis_topic} is {x_value}, which legend corresponds to a {y_axis_topic} of {y_value}?
•

At {x_value} on the {x_axis_topic}, which legend has a {y_axis_topic} value of {y_value}?
•

What legend’s {y_axis_topic} is {y_value} when the {x_axis_topic} reads {x_value}?
•

When {x_value} is the value of the {x_axis_topic}, which legend has a {y_axis_topic} of {y_value}?
•

For an {x_axis_topic} of {x_value}, which legend displays a {y_axis_topic} of {y_value}?
•

Which legend has a value of {y_value} when the {x_axis_topic} is {x_value}?
•

Which legend shows a value of {y_value} when the {x_axis_topic} equals {x_value}?
•

When the {x_axis_topic} is {x_value}, which legend corresponds to a value of {y_value}?
•

At {x_value} on the {x_axis_topic}, which legend has a value of {y_value}?
•

What legend’s value is {y_value} when the {x_axis_topic} reads {x_value}?
•

When {x_value} is the value of the {x_axis_topic}, which legend has a value of {y_value}?
•

For an {x_axis_topic} of {x_value}, which legend displays a value of {y_value}?

7.5 EvoChart-QA Source Websites

We selected 650 images from 140 publicly available websites for academic research purposes only. All sources are listed as follows:

•

https://www.beautiful.ai
•

https://www.formsbirds.com
•

https://leanscape.io
•

https://www.investopedia.com
•

https://www.storytellingwithdata.com
•

https://blog.finxter.com
•

https://www.degruyter.com
•

https://www.anychart.com
•

https://www.infragistics.com
•

https://awesomeopensource.com
•

https://fluttercore.com
•

https://www.nicesnippets.com
•

http://20bits.com
•

https://unreasonablegroup.com
•

https://mavink.com
•

https://www.smartsheet.com
•

https://template.wps.com
•

https://learn.microsoft.com
•

https://www.zoho.com
•

https://keski.condesan-ecoandes.org
•

https://www.statmethods.net
•

https://www.theinformationlab.com
•

https://www.pluralsight.com
•

https://www.visualitics.it
•

https://dribbble.com
•

https://infogram.com
•

https://beautifulai-od3.appspot.com
•

https://www.slideteam.net
•

https://sainsdata.id
•

https://www.elegantthemes.com
•

https://www.polymersearch.com
•

https://blog.csdn.net
•

https://aten.edu.vn
•

https://www.sakuranpost.net
•

https://imagesee.biz
•

https://search.justgulfwon.live
•

https://p.codekk.com
•

https://vitalflux.com
•

https://zebrabi.com
•

https://classfullprecisions.z13.web.core.windows.net
•

https://www.infocaptor.com
•

https://www.monkeybreadsoftware.de
•

https://mdpi.com
•

https://www.calxa.com
•

https://byggipedia.se
•

https://www.template.net
•

https://www.devtodev.com
•

https://www.bakertilly.com
•

https://www.researchgate.net
•

https://www.tessresearch.org
•

https://www.tandfonline.com
•

http://ww25.chartexamples.com
•

https://www.exceldemy.com
•

https://in.pinterest.com
•

https://blog.51cto.com
•

https://www.fusioncharts.com
•

https://inforiver.com
•

https://exceljet.net
•

https://x.com
•

https://stevenrattner.com
•

https://www.tillerhq.com
•

https://knifeknowitall.com
•

https://exploratory.io
•

https://www.r-bloggers.com
•

https://jethrojeff.com/
•

https://marcuscalan.blogspot.com
•

https://www.pewresearch.org/
•

https://www.smashingmagazine.com
•

https://chart-studio.plotly.com
•

https://www.mdpi.com
•

https://loganix.com
•

https://www.knowbe4.com
•

https://www.zdnet.com
•

https://goldhartmediation.ca
•

https://wiseinvestments.ca
•

https://laptrinhx.com
•

https://mungfali.com
•

https://data-flair.training
•

https://www.ncl.ac.uk
•

https://www.dummies.com
•

https://georgecarlo.blogspot.com
•

http://www.aploris.com
•

https://respect.international
•

https://www.pinterest.com
•

https://www.educba.com
•

https://www.statista.com/
•

https://slidebazaar.com
•

https://venngage.com
•

https://airfreesm.best
•

https://ar.pinterest.com
•

https://www.conceptdraw.com
•

https://www.ft.com
•

https://www.statology.org
•

https://exceljet.net/charts
•

https://www.slidekit.com
•

https://worksheetsploshes.z14.web.core.windows.net
•

https://www.hotzxgirl.com
•

https://www.genekitr.fun
•

https://medium.com
•

https://daydreamingnumbers.com
•

https://www.newsweek.com
•

https://www.mekkographics.com
•

https://nowbam.com
•

https://www.mercurynews.com
•

https://www.everviz.com
•

https://byjus.com
•

https://www.dailyrecord.co.uk
•

https://appfire.com
•

https://www.aiophotoz.com
•

https://socialbarrel.com
•

https://www.canadianpizzamag.com
•

https://edubenchmark.com
•

https://ieltsessentialindia.blogspot.com
•

https://forum.knime.com
•

https://docs.oracle.com
•

https://www.cec.health.nsw.gov.au
•

https://www.linkedin.com
•

https://online.stat.psu.edu
•

https://gmt-tutorials.org
•

https://gitcode.csdn.net
•

https://online.visual-paradigm.com
•

https://python-charts.com
•

https://cloud.tencent.com
•

https://www.cnblogs.com
•

https://help.xlstat.com
•

https://www.listendata.com
•

https://docs.thoughtspot.com
•

https://www.data-to-viz.com
•

https://fahimahmad.netlify.app
•

https://developer.aliyun.com
•

https://visme.co
•

https://pythonspot.com
•

https://data36.com
•

https://bootcamp.uxdesign.cc
•

https://revistaplural.es
•

https://environicsanalytics.com
•

https://mainpackage9.gitlab.io
•

https://en.wikipedia.org
•

https://riset.guru.pubiway.com
•

https://lopezcollege.weebly.com

References

[1] Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, and Ahmed Awadallah et al. Phi-3 technical report: A highly capable language model locally on your phone, 2024.
[2] Saleem Ahmed, Bhavin Jawade, Shubham Pandey, Srirangaraj Setlur, and Venu Govindaraju. Realcqa: Scientific chart question answering as a test-bed for first-order logic. In ICDAR, pages 14189: 66–83, 2023.
[3] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023.
[4] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023.
[5] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, and Yu et al. Qiao. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023.
[6] Zhi-Qi Cheng, Qi Dai, and Alexander G. Hauptmann. Chartreader: A unified framework for chart derendering and comprehension without heuristic rules. In ICCV, pages 22145–22156, 2023.
[7] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven C. H. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. In NeurIPS, 2023.
[8] Çaglar Gülçehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, Wolfgang Macherey, Arnaud Doucet, Orhan Firat, and Nando de Freitas. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023.
[9] Yucheng Han, Chi Zhang, Xin Chen, Xu Yang, Zhibin Wang, Gang Yu, Bin Fu, and Hanwang Zhang. Chartllama: A multimodal LLM for chart understanding and generation. arXiv preprint arXiv:2311.16483, 2023.
[10] Kushal Kafle, Brian L. Price, Scott Cohen, and Christopher Kanan. DVQA: understanding data visualizations via question answering. In CVPR, pages 5648–5656, 2018.
[11] Kushal Kafle, Robik Shrestha, Brian L. Price, Scott Cohen, and Christopher Kanan. Answering questions about data visualizations using efficient bimodal fusion. In WACV, pages 1487–1496, 2020.
[12] Samira Ebrahimi Kahou, Vincent Michalski, Adam Atkinson, Ákos Kádár, Adam Trischler, and Yoshua Bengio. Figureqa: An annotated figure dataset for visual reasoning. In ICLR, 2018.
[13] Deqing Li, Honghui Mei, Yi Shen, Shuang Su, Wenli Zhang, Junting Wang, Ming Zu, and Wei Chen. Echarts: A declarative framework for rapid construction of web-based visualization. VI, page 2(2): 136–146, 2018.
[14] Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models. arXiv preprint arXiv:2407.07895, 2024.
[15] Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, Jiaming Han, Siyuan Huang, Yichi Zhang, Xuming He, Hongsheng Li, and Yu Qiao. SPHINX: the joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023.
[16] Fangyu Liu, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Yasemin Altun, Nigel Collier, and Julian Martin Eisenschlos. Matcha: Enhancing visual language pretraining with math reasoning and chart derendering. In ACL, pages 12756–12770, 2023.
[17] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In CVPR, pages 26296–26306, 2024.
[18] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023.
[19] Ahmed Masry, Parsa Kavehzadeh, Do Xuan Long, Enamul Hoque, and Shafiq Joty. Unichart: A universal vision-language pretrained model for chart comprehension and reasoning. In EMNLP, pages 14662–14684, 2023.
[20] Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq R. Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In Findings of ACL, pages 2263–2279, 2022.
[21] Ahmed Masry, Mehrad Shahmohammadi, Md. Rizwan Parvez, Enamul Hoque, and Shafiq Joty. Chartinstruct: Instruction tuning for chart comprehension and reasoning. arXiv preprint arXiv:2403.09028, 2024.
[22] Ahmed Masry, Megh Thakkar, Aayush Bajaj, Aaryaman Kartha, Enamul Hoque, and Shafiq Joty. Chartgemma: Visual instruction-tuning for chart reasoning in the wild, 2024.
[23] Fanqing Meng, Wenqi Shao, Quanfeng Lu, Peng Gao, Kaipeng Zhang, Yu Qiao, and Ping Luo. Chartassisstant: A universal chart multimodal language model via chart-to-table pre-training and multitask instruction tuning. arXiv preprint arXiv: 2401.02384, 2024.
[24] Nitesh Methani, Pritha Ganguly, Mitesh M. Khapra, and Pratyush Kumar. Plotqa: Reasoning over scientific plots. In WACV, pages 1516–1525, 2020.
[25] OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, and Florencia Leoni Aleman et al. Gpt-4 technical report, 2024.
[26] Daniel Rose, Vaishnavi Himakunthala, Andy Ouyang, Ryan He, Alex Mei, Yujie Lu, Michael Saxon, Chinmay Sonar, Diba Mirza, and William Yang Wang. Visual chain of thought: Bridging logical gaps with multimodal infillings, 2024.
[27] Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, and Soroosh Mariooryad et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024.
[28] Dennis Ulmer, Elman Mansimov, Kaixiang Lin, Justin Sun, Xibin Gao, and Yi Zhang. Bootstrapping llm-based task-oriented dialogue agents via self-talk, 2024.
[29] Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, and Jie Tang. Cogvlm: Visual expert for pretrained language models, 2023.
[30] Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, Alexis Chevalier, Sanjeev Arora, and Danqi Chen. Charxiv: Charting gaps in realistic chart understanding in multimodal llms. arXiv preprint arXiv:2406.18521, 2024.
[31] Renqiu Xia, Bo Zhang, Hancheng Ye, Xiangchao Yan, Qi Liu, Hongbin Zhou, Zijun Chen, Min Dou, Botian Shi, Junchi Yan, and Yu Qiao. Chartx & chartvlm: A versatile benchmark and foundation model for complicated chart reasoning. arXiv preprint arXiv:2402.12185, 2024.
[32] Fangzhi Xu, Qiushi Sun, Kanzhi Cheng, Jun Liu, Yu Qiao, and Zhiyong Wu. Interactive evolution: A neural-symbolic self-training framework for large language models. arXiv preprint arXiv:2406.11736, 2024.
[33] Fangzhi Xu, Zhiyong Wu, Qiushi Sun, Siyu Ren, Fei Yuan, Shuai Yuan, Qika Lin, Yu Qiao, and Jun Liu. Symbol-llm: Towards foundational symbol-centric interface for large language models. arXiv preprint arXiv:2311.09278, 2023.
[34] Kang Min Yoo, Dongju Park, Jaewook Kang, Sang-Woo Lee, and Woo-Myoung Park. Gpt3mix: Leveraging large-scale language models for text augmentation. In Findings of EMNLP, pages 2225–2239, 2021.
[35] Liang Zhang, Anwen Hu, Haiyang Xu, Ming Yan, Yichen Xu, Qin Jin, Ji Zhang, and Fei Huang. Tinychart: Efficient chart understanding with visual token merging and program-of-thoughts learning. arXiv preprint arXiv: 2404.16635, 2024.