: Visual Unit Tests for More Robust Visual Programming
Abstract
Programming based approaches to reasoning tasks have substantially expanded the types of questions models can answer about visual scenes. Yet on benchmark visual reasoning data, when models answer correctly, they produce incorrect programs 33% of the time. These models are often right for the wrong reasons and risk unexpected failures on new data. Unit tests play a foundational role in ensuring code correctness and could be used to repair such failures. We propose Visual Unit Testing (ViUniT), a framework to improve the reliability of visual programs by automatically generating unit tests. In our framework, a unit test is represented as a novel image and answer pair meant to verify the logical correctness of a program produced for a given query. Our method leverages a language model to create unit tests in the form of image descriptions and expected answers and image synthesis to produce corresponding images. We conduct a comprehensive analysis of what constitutes an effective visual unit test suite, exploring unit test generation, sampling strategies, image generation methods, and varying the number of programs and unit tests. Additionally, we introduce four applications of visual unit tests: best program selection, answer refusal, re-prompting, and unsupervised reward formulations for reinforcement learning. Experiments with two models across three datasets in visual question answering and image-text matching demonstrate that ViUniT improves model performance by 11.4%. Notably, it enables 7B open-source models to outperform gpt-4o-mini by an average of 7.7% and reduces the occurrence of programs that are correct for the wrong reasons by 40%.
1 Introduction


Visual Programming [14, 49], which involves generating executable programs that leverage state-of-the-art specialist systems (e.g. object detection, captioning, etc.), has emerged as an effective method for tackling compositional reasoning tasks [48, 15]. Often correct visual programs must be inferred without training programs because they are expensive to annotate. Recently, some methods improve the performance of visual program synthesis by leveraging programs that yield correct results on training data [23, 29]. While these approaches have shown improvements, a critical limitation persists: visual programs can be right for the wrong reasons. For example, human evaluation of 100 visual programs resulting in correct responses generated by CodeLlama-7B111CodeLlama-7B [41] is a leading open source large language model (LLM) for questions in GQA [20], showed that only 33% of them were actually correct and 70% of the incorrect programs (23% of the total) would require significant rewriting to be correct.
To mitigate this prevailing issue, we propose Visual Unit Testing (), a framework for automatically generating unit tests for visual programs.
While automatic unit test generation has gained momentum in text-based tasks [5, 46, 2, 12, 50], its application to visual program synthesis has been limited.
Recent efforts toward visual units tests focused primarily on checking program return value types (e.g. the output falling outside a range of options, like yes or no) [25].
However, this approach does not assess the program’s execution or logical correctness, limiting the types of errors it can address.
In this work, we bridge this gap by addressing challenges that have hindered unit test use in visual question answering (VQA) and image-text-matching (ITM).
As seen in Figure 1, visual programming converts queries to code that executes on test images to provide a response. For such programs, unit tests take the form of images and expected answers. Units test are difficult to construct because they need to have sufficient coverage to diagnose errors. To solve this problem, we leverage language models to generate candidate sets of descriptions of images that could test the code (Section 3.2.1). We formulate an optimization criterion to select ones that maximize coverage of possible program inputs and outputs (Section 3.2.2), and convert selected descriptions to images (Section 3.2.3). Our approach is entirely unsupervised with no accompanying annotations.
Unit tests can be used to identify incorrect programs but integrating this signal to improve model behavior is challenging. In Section 3.4 we explore several mechanisms, summarized in Figure 2, including:
- wide
-
wiide
Re-prompting: We use unit test outputs to guide the generation of improved programs when initial programs perform poorly on the unit test suite. Relative to regeneration without unit tests, programs are over 3% more accurate (Table 3).
-
wiiide
Unsupervised Reinforcement Learning (RL) Reward Design: We use unit test scores as feedback to fine-tune an LLM on programs more likely correct for the right reasons, surpassing supervised correctness-based rewards by an average of 1.3 points across tasks (Table 4).
-
wivde
Answer refusal: Unit test scores are used to assess program confidence, reverting to an end-to-end model if the program is not robust, achieving up to 0.8 F1 score in correctly refusing programs that would fail (Figure 9).
To summarize our contributions, we present , the first framework to introduce unit tests that verify the logical correctness of visual programs.
We conduct a broad exploration of unit test generation configurations (Section 5), showing that maximizing coverage is an important criterion.
We introduce four ways to leverage unit-tests to improve models (Section 3.4): best program selection, answer refusal, re-prompting, and unsupervised reward design for reinforcement learning.
Overall, integrating unit-tests improves frozen-LLM accuracy by 11.4% and enables 7B open-source LLMs to outperform proprietary models like gpt-4o-mini by an average of 7.7 points, while improving
underlying code correctness.
Broader adoption of unit-test suits will significantly enchase robustness and trust of visual programming approaches.
2 Related Work
Visual Program Synthesis: The recent advancements in LLMs [1, 33, 53, 54, 4, 3, 21, 36] have led to their use as a planning interface for the modularization of tools to execute complex reasoning tasks involving multiple modalities [34, 42, 47, 9, 29] and as a reasoning module for visual agents [58, 59, 57, 16]. Specialized coding LLMs [41, 27, 51, 13] have demonstrated significant potential in addressing visual challenges by generating executable code based on contextual demonstrations [49, 14, 15, 10, 55] with comparable or better performance to vision language models [28, 37, 31, 6]. Attempts to improve the initial paradigm involve automatically generating a pool of effective programs to retrieve as in-context examples [48] and tuning a model through reinforcement learning by sampling programs that succeed on the training set [23]. More relevant to this work, Hu et al. [19] distill program reasoning into a VLM as chain-of-thought reasoning by generating multiple programs per query and selecting the best one, either by using the ground truth answer as a proxy for correctness or by having it evaluated by an LLM. However, a critical issue remains: some generated programs achieve correct outcomes without sound reasoning, which we address in this paper.
LLM Unit Test Generation: Unit tests have been used as reinforcement learning signal to train code-generating LLMs [27, 5, 44, 12, 45, 7]. Existing methods for automatic unit test generation with LLMs [5, 2, 12, 50] focus primarily on text-based tasks, generating entire unit test scripts. However, these approaches often result in issues like compilation errors, low coverage, redundant assertions, and empty tests [46]. Recent work [25] proposes property testing on the outputs of visual programs by leveraging LLMs to generate properties that should be satisfied by the output given the query (e.g. the output should be a color if the query asks for one). Yet, this method inherits many limitations of LLM generated script-based unit testing, and crucially, it fails to assess logical correctness—meaning it overlooks cases where program outputs may be right for the wrong reasons. Instead, we propose a method of generating unit tests to verify the execution of visual programs, without requiring an LLM to directly generate unit-test scripts, avoiding such issues that tend to accompany the automatic generation of unit tests using LLMs. In particular, we use LLMs to generate image descriptions and expected answers without requiring any direct code generation. Image descriptions and expected answers are then transformed to a unit test using a text-to-image diffusion model [40].

3 Method
In this section, we formalize the tasks of visual program synthesis and unit test generation (Section 3.1) and introduce our framework (Section 3.2).
Our method comprises two main components: unsupervised generation of visual unit tests (Section 3.2) and unit test scoring (Section 3.3). We propose four ways to leverage unit tests in Section 3.4: Best Program Selection, Answer Refusal, Re-Prompting, and Unsupervised RL Reward Design.
3.1 Task Definition
Visual Program Synthesis: Given a visual input and a textual query about , our goal is to synthesize a program that correctly answers about . Each program is executed on the visual input using an execution engine , yielding a predicted answer . Our objective is to select the program that is most likely to produce the correct answer to the query about , formalized as:
(1) |
Visual Unit Testing: To assess the candidate programs, we employ a unit test generator , which generates a set of unit tests . Each unit test consists of a test visual input and the corresponding correct answer to the query on that input . For each candidate program , we execute it on all test inputs to obtain outputs .
3.2 Unsupervised Visual Unit Test Generation


Given a program to solve a query , our goal is to generate a set of unit tests comprising input images and expected answers, as shown in Figure 3. This process involves three main steps: Candidate Unit Test Generation (Section 3.2.1), Unit Test Sampling (Section 3.2.2), and Image Generation (Section 3.2.3).
3.2.1 Candidate Unit Test Generation
As illustrated in Figure 1, rather than generating images directly for unit tests, we first create image descriptions with expected answers. This approach reduces computational overhead during the preliminary stage of unit test coverage sampling, after which we generate images only for those tests that are included in the final unit test suite . In particular, we first generate a superset of candidate unit tests using the unit test generator , which is implemented as an auto-regressive large language model. The unit test generator can take both the query and the program implementation as inputs . Each candidate unit test consists of an image caption and an expected answer . We explore whether including the program implementation provides useful signals for unit test generation (Section 5), despite conventional engineering practices that advocate for implementation-independent unit tests. This allows us to investigate whether this principle extends to visual unit testing.
3.2.2 Unit Test Coverage Sampling
Unit tests
verify the behavior of code and should exhibit high isolation and coverage [24]. In the context of visual programs, isolation is trivial since each program is a self-contained function. However, achieving high coverage—ensuring that the tests collectively exercise as much of the codebase as possible—is non-trivial due to the computational overhead of executing all candidate tests. To address this, we define coverage metrics tailored for visual programming unit tests, focusing on maximizing the diversity of both expected answers and visual inputs. The coverage sampler subsamples pairs from , forming the subset .
Coverage by Answer: We aim to include tests that cover all possible expected answers present in the candidate set. Let be the set of all expected answers in .
We define the answer diversity criterion as ensuring that for every possible answer , there is at least one test such that :
(2) |
Coverage by Input: To maximize the diversity of visual inputs without generating all possible images, we operate on the image captions. We define an encoding function that maps a caption to a feature vector. We aim to maximize the input diversity score , defined as the maximum pairwise distance between the encoded captions:
(3) |
This encourages the selection of tests with diverse descriptions, which in turn is likely to yield diverse images.
Coverage by Answer then Input: We begin by selecting one test for each possible answer to satisfy the answer diversity criterion (Equation 2). Then, we iteratively select additional tests to maximize using the following criterion until tests are selected, forming the subset .
(4) |
3.2.3 Image Generation
For each selected unit test , we generate the corresponding image using a text-to-image model to yield the final unit-test suite . We employ three state-of-the-art diffusion models: SDv1.4 [40], SDXL3 [38], and LM Guided Diffusion [30] which utilizes automatically generated templates with phrases and bounding boxes for spatial conditioning [30]. To provide these additional signals, we prompt an LLM with in-context examples and the caption to generate pairs of phrases and bounding boxes to feed into the text-to-image model: .
3.3 Program Selection Based on Unit Test Scores
We select the program that succeeds on most unit tests by Equation 6, where the overall score is computed by an aggregator over individual scores .
Individual Unit Test Scorer : For each program and test , we execute on to obtain the predicted answer . We define a scoring function that assigns a score based on the program’s output:
(5) |
where and are runtime and compilation error penalties and is the indicator function.
Score Aggregator : The individual scores are aggregated to compute an overall score . Here, represents the averaging function. The program with the highest score is selected as the best candidate approximating Equation 1 by:
(6) |
3.4 Visual Unit Test Utilization Methods
Figure 2 illustrates how to leverage visual unit tests in four ways, further elaborated below:
Best Program Selection:
Given a set of candidate programs for a query , our goal is to select the program that is most likely to produce the correct answer when executed on the visual input . We utilize the unit test scores computed for each program as described in Section 3.3. The best program–the program succeeds on most unit tests– is selected by solving the optimization problem in Equation 6.
Answer Refusal: If the maximum unit test score falls below a threshold , indicating low confidence in all candidate programs, we refuse to provide a programmatic answer. Instead, we retreat to an end-to-end fallback method (refer to supplement for details). Formally, the decision rule is: .
Otherwise, we proceed to execute the selected program on the original visual input to obtain the final answer .
The hyperparameter balances a trade-off between attempting to answer with potentially incorrect programs and deferring to a more reliable but less interpretable method.
Re-Prompting: If all generated programs fail to meet the threshold (i.e., ), we employ a re-prompting strategy to generate better candidate programs using feedback from unit tests:
(7) |
where: is an adaptation of the original input containing the API, the query , and in-context examples of unit-test-feedback corrections, and is the feedback derived from unit test results 222 comprises unit test image descriptions, expected answers, and the predicted answers generated by the program in the current iteration., summarizing the discrepancies between expected and actual outputs, and is the program generator.
We select the best program from the new set based on their unit test scores .
If , we execute on the original visual input . Otherwise, we may repeat the re-prompting process until a predefined number of iterations is reached.
Unsupervised Reinforcement Learning Reward Design We propose to design RL rewards based on visual unit tests, aiming not only to provide extra supervision but also curtail policy deterioration due to logically incorrect programs [23]. The goal is to optimize a policy implemented as an autoregressive language model for program generation , parameterized by , by minimizing the reward-weighted loss over the dataset , where each example consists of an image , user query , generated program by the previous iteration’s policy , and ground truth answer :
(8) |
where is the negative log-likelihood loss on next token prediction and is the sequence length .
Khan et al. [23] introduce a correctness reward based on performance on the training set:
(9) |
However, this approach can lead to sparse rewards and may falsely reward programs that are right for incorrect reasons. Khan et al. [23] address this issue through human corrections to stabilize training. Instead we reformulate the reward using feedback from the visual unit tests:
(10) |
where is a passing threshold. We terminate policy iteration on declining reward. Following earlier work [22], we assume that an optimal policy will keep increasing an optimal reward function . Thus, when our proxy reward declines (i.e., regret increases), there are theoretical guarantees that we are not far from the optimal policy that can be learned under .
4 Experimental Setup
Below is the experimental setup: datasets (Section 4.1), baselines (Section 4.2), and implementation details (Section 4.3).
4.1 Data
We utilize three compositional reasoning datasets: GQA [20] for Visual Question Answering (VQA), SugarCREPE [17], and Winoground [52] for Image-Text Matching (ITM), assessing model performance via accuracy metrics. For GQA, we calculate accuracy using an implementation by Surís et al. [49], which standardizes and compares generated answers for exact matches.333https://github.com/cvlab-columbia/viper/blob/main/datasets/gqa.py Our experimental setup incorporates training and testing splits sampled similar to Khan et al. [23], specifically testing on 502 examples from the GQA balanced-val split and training on 1022 examples from the balanced-train split, with 10 samples per question group. In SugarCREPE, we utilize 788 examples for training by subsampling approximately 10% of the dataset balanced across question types, excluding our validation split. The validation subset consists of 560 examples and includes both positive and negative image-text pairings from 40 samples from each of the 7 question types. The full Winoground dataset is used, encompassing all possible positive and negative pairings for a total of 1600 test examples, with the SugarCREPE dataset employed for training purposes. Refer to the supplement for further dataset details.


4.2 Baselines
We evaluate against the following baselines:
Base Setup: Following the prototypical use of visual programs [49, 14], we prompt the LLM to generate a single program per query, which is executed to retrieve a response.
Most Common Answer: To leverage multiple programs, we compare performance with selecting the most common answer across executed programs if one exists.
Error Re-prompting: To evaluate the effectiveness of unit-test incorporation in program correction via unit-test re-prompting, we benchmark performance against a method that leverages error-traces as feedback in Equation 7. Further details are provided in the supplement.
Correctness Reward: We baseline unsupervised unit-test RL reward fomulation against the supervised correctness reward described by Equation 9.
4.3 Implementation Details
We provide a summary of key implementation details, with additional information in the supplement. Experiments were conducted on two A100 40GB GPUs, though a single GPU suffices for smaller API models. Results report the mean and standard deviation across 3 runs.
Program Generation Models: Three program generator models are employed, codellama/CodeLlama-7b-Python-hf [41] and google/codegemma-7b-it [51] hosted on Hugginface and served by VLLM [26], as well as gpt-4o-mini [1] served by OpenAI. We use HuggingFace’s SFT-Trainer to train the RL policy using LoRA [18] with in Equation 10. Models are prompted with an API adapted from ViperGPT [49] and 4 in-context examples.
API Models: Object detection is performed using IDEA-Research/grounding-dino-base [32]. For image-text matching, we use openai/clip-vit-large-patch14-336 [39], and for VQA answering, we employ Salesforce/blip2-flan-t5-xxl [28]. All models are accessed through HuggingFace.
Unit Test Generation Models: We use meta-llama/Meta-Llama-3-8B-Instruct [8] to generate image descriptions and expected answers for unit test candidates. The unit test sampler is implemented with sentence-transformers, using the all-MiniLM-L6-v2 [56] model to embed image descriptions. For image generation, we use the diffusers library, specifically CompVis/stable-diffusion-v1-4 for SDv1.4, longlian/lmd_plus for LM Guided Diffusion, and stabilityai/stable-diffusion-xl-base-1.0 for SDXL3.
Program Scoring and Execution: Program executions are capped at 120 seconds. Unit test scoring error penalties are set to (Equation 5). Unless specified, no end-to-end model retreat was employed on exception.
5 Strategies for Visual Unit Test Generation
We explore different unit test generation configurations applied on best program selection using a smaller dataset of three questions from each group in GQA, and each tag in WinoGround, yielding 303 and 504 samples, respectively.

Number of unit tests . Figure 5 illustrates that increasing both the number of unit tests and the number of candidate programs improves accuracy on both datasets. Accuracy rises substantially with the addition of unit tests, particularly from 1 to 5 tests, after which gains diminish. Higher numbers of programs (e.g., 4 or 5) consistently yield better accuracy compared to fewer programs, underscoring the benefit of exploring multiple candidate solutions.

Unit Test Generator . Figure 6 demonstrates that in low unit test settings, incorporating program information into unit test generation yields comparable results to query-only approaches. However, as the number of unit tests and programs increases, disregarding implementation details proves significantly more effective. This aligns with software engineering best practices, where unit tests are designed to remain independent of specific implementations.

Unit Test Sampler . Figure 7 demonstrates the impact of different unit test sampling methods on model accuracy. In GQA, “Coverage By Answer then Input” shows increasing performance as the number of unit tests grows, thus allowing the saturation of possible answers. Figure 4(a) highlights limitations of the other methods: “Coverage by Input” may suffer from reduced answer diversity, and “Coverage by Answer” could involve repetitive inputs. In WinoGround there is negligible difference across methods, due to its restriction to two answers, preventing significant sampling diversity. Nevertheless, an analysis of performance by question-type in the supplement shows that this sampling method yields higher results for attribute-related queries in both datasets.

Image Generator . Figure 8 illustrates the impact of different diffusion models. In GQA at lower unit test settings LM Guided diffusion yields some accuracy improvements, while for WinoGround, LM Guided diffusion only helps in lower program settings, with quick convergence as the number of program increases. The benefit of LM Guided diffusion is primarily driven by improved tests when spatial positioning is critical as shown with the result breakdowns in the supplement and illustrated in Figure 4(b).

Scoring function . The supplement presents results with varying error penalties, illustrating that in few unit test settings imposing error penalties enhances the likelihood of selecting a successful program.
VQA | Image-Text Matching | |||||
LLM | # Prog | # UT | GQA | Winoground | SugarCREPE | Avg. |
Base Setup | ||||||
gpt-4o-mini | 1 | 0 | 42.03±1.21 | 44.98±0.75 | 38.75±0.47 | 41.92±0.81 |
CodeLlama-7B | 1 | 0 | 35.99±2.94 | 38.83±0.45 | 30.54±0.99 | 35.12±1.46 |
CodeGemma-7B | 1 | 0 | 41.83±2.26 | 39.60±1.38 | 42.56±1.52 | 41.33±1.72 |
Most Common Answer Setup | ||||||
CodeLlama-7B | 5 | 0 | 42.50±1.50 | 45.85±0.77 | 41.67±1.79 | 43.34±1.35 |
CodeGemma-7B | 5 | 0 | 43.89±0.98 | 46.04±1.48 | 46.67±1.69 | 45.53±1.38 |
ViUniT Setup (Ours) | ||||||
CodeLlama-7B | 5 | 5 | 49.27±1.33 | 49.73±0.73 | 47.02±1.19 | 48.67±1.08 |
CodeGemma-7B | 5 | 5 | 48.01±1.05 | 51.92±0.90 | 51.85±2.16 | 50.59±1.37 |
VQA | Image-Text Matching | |||||
LLM | # Prog | # UT | GQA | Winoground | SugarCREPE | Avg. |
Reverting on Error | ||||||
CodeLlama-7B | 1 | 0 | 44.89±2.04 | 51.67±1.16 | 49.29±0.99 | 48.61±1.40 |
CodeGemma-7B | 1 | 0 | 44.89±2.19 | 47.25±2.17 | 49.58±0.88 | 47.24±1.74 |
Reverting on ViUniT Threshold (Ours) | ||||||
CodeLlama-7B | 1 | 5 | 54.18±0.40 | 50.67±1.28 | 49.05±0.82 | 51.30±0.84 |
CodeGemma-7B | 1 | 5 | 54.58±1.24 | 50.73±0.94 | 50.12±1.62 | 51.81±1.27 |
VQA | Image-Text Matching | ||||||
LLM | Iter. | # Prog | # UT | GQA | Winoground | SugarCREPE | Avg. |
Error Reprompting | |||||||
CodeLlama-7B | 1 | 1 | 0 | 37.92±2.68 | 42.46±0.57 | 33.21±0.64 | 37.86±1.30 |
CodeGemma-7B | 1 | 1 | 0 | 42.63±2.42 | 42.42±1.91 | 44.52±1.05 | 42.63±2.42 |
ViUniT Reprompting (Ours) | |||||||
CodeLlama-7B | 1 | 1 | 5 | 46.68±2.52 | 51.85±0.40 | 47.68±2.17 | 48.74±1.69 |
CodeGemma-7B | 1 | 1 | 5 | 45.75±0.30 | 48.19±2.28 | 48.21±1.12 | 47.38±1.23 |
VQA | Image-Text Matching | |||||
LLM | # Prog | # UT | GQA | Winoground | SugarCREPE | Avg. |
Supervised Correctness Reward | ||||||
CodeLlama-7B | 1 | 0 | 39.18±4.88 | 48.65±0.87 | 39.58±2.75 | 42.47±2.83 |
CodeGemma-7B | 1 | 0 | 43.03±5.08 | 45.98±2.64 | 46.31±2.26 | 45.11±3.33 |
Unsupervised ViUniT Reward (Ours) | ||||||
CodeLlama-7B | 1 | 0 | 40.57±2.10 | 46.52±0.81 | 41.85±1.44 | 42.98±1.45 |
CodeGemma-7B | 1 | 0 | 45.68±2.45 | 49.29±0.43 | 46.55±0.69 | 47.17±1.19 |
6 Strategies of Visual Unit Test Utilization
Best Program Selection:
Table 1 underscores the efficacy of selection in identifying the most optimal program. Our approach demonstrates a notable average improvement of 11.4 accuracy points over the base setup and a substantial 7.7-point average gain over the gpt-4o-mini configuration. Furthermore, it surpasses most common answer selection by an average margin of 5.2 points.
Answer Refusal: Figure 9 illustrates the impact of varying the threshold on the F1 score of refusing programs with incorrect answers (left), and the false pass failure rate (right), measured relative to the total number of programs. The minimal false pass failure rate at higher thresholds supports the use of unit test scores as a proxy for correctness during unsupervised model fine-tuning. Table 2 showcases an improvement of 3.6 points of reverting to a fixed model when compared to reverting only on error. For CodeLlama-7B, performance on image-text matching is similar between the two methods, as some programs yield correct answers despite failing unit tests. Although such programs impact final performance, a human inspection of 40 samples revealed that 65% were unreliable from the start.
Re-prompting: Table 3 demonstrates that re-prompting with achieves an average improvement of 7.5 points over error-based re-prompting, with a notable 10.9-point increase for CodeLlama-7B, which performs lower in the base setting. The unit tests offer additional opportunities for refining the method’s initial response, as they go beyond error detection to assess program confidence, while also providing a measure of comparison between the programs.
RL Reward Design: The pattern of improvements is particularly interesting in the RL setting, where we find that rewards outperform correctness rewards by an average of 1.3 points in accuracy despite not relying on the training labels. Additionally, we observe a notable reduction in the percentage of code leading to exceptions; errors decrease from 14.47% to 11.76% for CodeLlama and even more sharply from 11.73% to 4.68% for CodeGemma. These results indicate that heavily rewarding higher-quality code, as filtered through unit tests, encourages the development of a more robust and error-resistant policy.
7 Human Evaluation
We summarize key findings from two human evaluations that assess unit test quality and improvements in program reliability. Full details are available in the supplement.
Unit Test Evaluation: We randomly sampled 20 examples from each of three datasets, each corresponding to 5 unit tests, resulting in a total of 300 unit tests, each of which was judged by three annotators. Based on the majority annotator response, 75% of unit tests per sample were correct.
Annotators could optionally comment on errors, with “Missing Object” noted as the most frequent issue.
Program Evaluation: To measure the effectiveness of unit tests in enhancing program reliability, we evaluated 100 VQA
programs that correctly answered the queries both from the base and the unit-test best program selection setups. Two annotators with 3+ years of Python experience graded programs from 0 (Fully Correct) to 3 (Irrelevant).
Under the unit test setup, 86% of programs were fully correct, compared to 77% in the base setup. Additionally, only 5% of programs were marked completely incorrect—with none deemed irrelevant—compared to 14% and 4%, respectively, in the base setup. Notably, the most common error type shifted from “Incorrect Logic” in the base setup to “Missing Checks (e.g., list index out of range)” in the unit-test setup.
8 Conclusion and Future Work
We introduce , the first framework to automatically generate unit tests for verifying visual program correctness, addressing cases where programs may appear correct for the wrong reasons. Unit tests are leveraged in four ways: best program selection (+11.4 points over the base setup and +7.7 points over gpt4o-mini), answer refusal, re-prompting, and unsupervised RL reward design (+1.3 points over supervised rewards). Future directions include fine-grained test generation and broader task applications. By reinforcing logical correctness,
advances robustness and interpretability in visual programs.
References
- Achiam et al. [2023] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Alagarsamy et al. [2024] Saranya Alagarsamy, Chakkrit Tantithamthavorn, and Aldeida Aleti. A3test: Assertion-augmented automated test case generation. Information and Software Technology, 176:107565, 2024.
- Anil et al. [2023] Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Chen et al. [2023] Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. Codet: Code generation with generated tests. In The Eleventh International Conference on Learning Representations, 2023.
- Dai et al. [2023] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Dou et al. [2024] Shihan Dou, Yan Liu, Haoxiang Jia, Enyu Zhou, Limao Xiong, Junjie Shan, Caishuang Huang, Xiao Wang, Xiaoran Fan, Zhiheng Xi, Yuhao Zhou, Tao Ji, Rui Zheng, Qi Zhang, Tao Gui, and Xuanjing Huang. StepCoder: Improving code generation with reinforcement learning from compiler feedback. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4571–4585, Bangkok, Thailand, 2024. Association for Computational Linguistics.
- Dubey et al. [2024] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
- Gao et al. [2024] Zhi Gao, Yuntao Du, Xintong Zhang, Xiaojian Ma, Wenjuan Han, Song-Chun Zhu, and Qing Li. Clova: A closed-loop visual assistant with tool usage and update. Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
- Ge et al. [2025] Jiaxin Ge, Sanjay Subramanian, Baifeng Shi, Roei Herzig, and Trevor Darrell. Recursive visual programming. In European Conference on Computer Vision, pages 1–18. Springer, 2025.
- Goyal et al. [2017] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
- Guilherme and Vincenzi [2023] Vitor Guilherme and Auri Vincenzi. An initial investigation of chatgpt unit test generation capability. In Proceedings of the 8th Brazilian Symposium on Systematic and Automated Software Testing, pages 15–24, 2023.
- Guo et al. [2024] Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024.
- Gupta and Kembhavi [2023] Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953–14962, 2023.
- [15] Cheng Han, James Chenhao Liang, Qifan Wang, MAJID RABBANI, Sohail Dianat, Raghuveer Rao, Ying Nian Wu, and Dongfang Liu. Image translation as diffusion visual programmers. In The Twelfth International Conference on Learning Representations.
- Hong et al. [2024] Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14281–14290, 2024.
- Hsieh et al. [2024] Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kembhavi, and Ranjay Krishna. Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality. Advances in neural information processing systems, 36, 2024.
- Hu et al. [2022] Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
- Hu et al. [2024] Yushi Hu, Otilia Stretcu, Chun-Ta Lu, Krishnamurthy Viswanathan, Kenji Hata, Enming Luo, Ranjay Krishna, and Ariel Fuxman. Visual program distillation: Distilling tools and programmatic reasoning into vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9590–9601, 2024.
- Hudson and Manning [2019] Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
- Jiang et al. [2023] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- [22] Jacek Karwowski, Oliver Hayman, Xingjian Bai, Klaus Kiendlhofer, Charlie Griffin, and Joar Max Viktor Skalse. Goodhart’s law in reinforcement learning. In The Twelfth International Conference on Learning Representations.
- Khan et al. [2024] Zaid Khan, Vijay Kumar BG, Samuel Schulter, Yun Fu, and Manmohan Chandraker. Self-training large language models for improved visual program synthesis with visual reinforcement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14344–14353, 2024.
- Khorikov [2020] Vladimir Khorikov. Unit Testing Principles, Practices, and Patterns. Simon and Schuster, 2020.
- Koo et al. [2024] Jaywon Koo, Ziyan Yang, Paola Cascante-Bonilla, Baishakhi Ray, and Vicente Ordonez. PropTest: Automatic property testing for improved visual programming. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 8241–8256, Miami, Florida, USA, 2024. Association for Computational Linguistics.
- Kwon et al. [2023] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023.
- Le et al. [2022] Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Hoi. CodeRL: Mastering code generation through pretrained models and deep reinforcement learning. In Advances in Neural Information Processing Systems, 2022.
- Li et al. [2023] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023.
- Li et al. [2024] Zhuowan Li, Bhavan Jasani, Peng Tang, and Shabnam Ghadar. Synthesize step-by-step: Tools templates and llms as data generators for reasoning-based chart vqa. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13613–13623, 2024.
- Lian et al. [2024] Long Lian, Boyi Li, Adam Yala, and Trevor Darrell. LLM-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models. Transactions on Machine Learning Research, 2024. Featured Certification.
- Liu et al. [2024a] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024a.
- Liu et al. [2024b] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In European Conference on Computer Vision. Springer, 2024b.
- Longpre et al. [2023] Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. The flan collection: Designing data and methods for effective instruction tuning. In International Conference on Machine Learning, pages 22631–22648. PMLR, 2023.
- Lu et al. [2024] Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao. Chameleon: Plug-and-play compositional reasoning with large language models. Advances in Neural Information Processing Systems, 36, 2024.
- Marino et al. [2019] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019.
- Nijkamp et al. [2023] Erik Nijkamp, Tian Xie, Hiroaki Hayashi, Bo Pang, Congying Xia, Chen Xing, Jesse Vig, Semih Yavuz, Philippe Laban, Ben Krause, et al. Xgen-7b technical report. arXiv preprint arXiv:2309.03450, 2023.
- Panagopoulou et al. [2024] Artemis Panagopoulou, Le Xue, Ning Yu, Junnan Li, Dongxu Li, Shafiq Joty, Ran Xu, Silvio Savarese, Caiming Xiong, and Juan Carlos Niebles. X-instructblip: A framework for aligning x-modal instruction-aware representations to llms and emergent cross-modal reasoning. In Proceedings of the European Conference on Computer Vision (ECCV), 2024.
- Podell et al. [2024] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. In The Twelfth International Conference on Learning Representations, 2024.
- Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022.
- Roziere et al. [2023] Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- Schick et al. [2024] Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36, 2024.
- Selvaraju et al. [2020] Ramprasaath R Selvaraju, Purva Tendulkar, Devi Parikh, Eric Horvitz, Marco Tulio Ribeiro, Besmira Nushi, and Ece Kamar. Squinting at vqa models: Introspecting vqa models with sub-questions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10003–10011, 2020.
- Shen et al. [2023] Bo Shen, Jiaxin Zhang, Taihong Chen, Daoguang Zan, Bing Geng, An Fu, Muhan Zeng, Ailun Yu, Jichuan Ji, Jingyang Zhao, et al. Pangu-coder2: Boosting large language models for code with ranking feedback. arXiv preprint arXiv:2307.14936, 2023.
- [45] Parshin Shojaee, Aneesh Jain, Sindhu Tipirneni, and Chandan K Reddy. Execution-based code generation using deep reinforcement learning. Transactions on Machine Learning Research.
- Siddiq et al. [2023] Mohammed Latif Siddiq, Joanna Santos, Ridwanul Hasan Tanvir, Noshin Ulfat, FA Rifat, and V Carvalho Lopes. Exploring the effectiveness of large language models in generating unit tests. arXiv preprint arXiv:2305.00418, 2023.
- Singh et al. [2019] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019.
- [48] Aleksandar Stanić, Sergi Caelles, and Michael Tschannen. Towards truly zero-shot compositional visual reasoning with llms as programmers. Transactions on Machine Learning Research.
- Surís et al. [2023] Dídac Surís, Sachit Menon, and Carl Vondrick. Vipergpt: Visual inference via python execution for reasoning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11888–11898, 2023.
- Takerngsaksiri et al. [2024] Wannita Takerngsaksiri, Rujikorn Charakorn, Chakkrit Tantithamthavorn, and Yuan-Fang Li. Tdd without tears: Towards test case generation from requirements through deep reinforcement learning. arXiv preprint arXiv:2401.07576, 2024.
- Team [2024] CodeGemma Team. Codegemma: Open code models based on gemma. arXiv preprint arXiv:2406.11409, 2024.
- Thrush et al. [2022] Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5228–5238. IEEE Computer Society, 2022.
- Touvron et al. [2023a] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Touvron et al. [2023b] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Ukai et al. [2024] Mahiro Ukai, Shuhei Kurita, Atsushi Hashimoto, Yoshitaka Ushiku, and Nakamasa Inoue. Adacoder: Adaptive prompt compression for programmatic visual question answering. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 9234–9243, 2024.
- Wang et al. [2020] Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems, 33:5776–5788, 2020.
- Wei et al. [2024] Yuxi Wei, Zi Wang, Yifan Lu, Chenxin Xu, Changxing Liu, Hao Zhao, Siheng Chen, and Yanfeng Wang. Editable scene simulation for autonomous driving via collaborative llm-agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15077–15087, 2024.
- Yang et al. [2022] Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Yumao Lu, Zicheng Liu, and Lijuan Wang. An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI conference on artificial intelligence, pages 3081–3089, 2022.
- Yang et al. [2023] Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023.
Appendix A Data
The three compositional reasoning datasets used in this work are GQA [20], SugarCREPE [17], and WinoGround [52]. Table 5 shows examples from each dataset, and table 6 summarizes the dataset statistics. For GQA validation we sample 5 questions from each of the 102 question groups from the balanced-val split with a total of 502 examples. For testing, we sample 10 questions per group from the balanced-train split yielding 1022 examples. Note that some groups such as typeVerifyC, stateChoose, and companyVerify do not have a sufficient amount of questions, so we sample the whole group. For SugarCREPE, we utilize 788 examples for training by subsampling 10% of the dataset balanced across the 7 question types, excluding our validation split. This validation subset consists of 560 examples and includes both positive and negative image-text pairings from 40 samples from each of the 7 question types. The full Winoground dataset is used, encompassing all possible positive and negative pairings for a total of 1600 examples, with the SugarCREPE dataset employed for training.
Image | Question | Answer |
GQA | ||
![]() |
Are there any guys to the right of the brown horse? | no |
![]() |
Which direction is the animal that looks white and brown looking at? | forward |
![]() |
What type of animal is that fence behind of, an elephant or a giraffe? | giraffe |
SugarCREPE | ||
![]() |
Is there a white pitcher holding flowers in a window sill? | yes |
![]() |
Are a cat and a dog napping together under a blanket on the couch? | no |
![]() |
Is a dog sitting in front of a laptop on top of a bed? | yes |
WinoGround | ||
![]() |
Verify image matches text=“two humans and one wheel” | yes |
![]() |
Verify image matches text=“red building with white shutters” | no |
![]() |
Verify image matches text=“the person with the white collared shirt waters the plant while the other holds it” | yes |
# Samples | # Images | # Questions | # Answers | # Question Types | # Questions/Type |
---|---|---|---|---|---|
GQA | |||||
1022/502 | 1014/487 | 937/474 | 176/122 | 105/102 | 10/5 |
WinoGround | |||||
-/1600 | -/800 | -/800 | -/2 | -/70 | -/8 |
SugarCREPE | |||||
788/560 | 335/260 | 765/557 | 2/2 | 7/7 | 52/80 |
Appendix B Unit Test Sampling Pseudocode
For clarity, Algorithm 1 presents the pseudocode for the unit test coverage sampling method described in Section 3.
Appendix C Program Generation and Execution
In this section, we outline the implementation details for program generation and execution.
C.1 Generation Details
For program generation we use in context examples both in of-the-shelf inference, and finetuned model inference. Generation is conducted using VLLM with the following generation parameters: temperature=1.0, top_p=0.9, top_k=0.0, max_new_tokens=320, and num_beams=1. We set the temperature at a high value to ensure diversity in generated programs. For CodeLLaMA we prefix the prompt with <s>, and for CodeGemma we enclose it in <bos><start_of_turn>[..]<end_of_turn>
C.2 Image Patch API
We present the ImagePatch API in Code LABEL:code:api_prompt which we adapt the from Khan et al. [23] which is in turn adapted from ViperGPT Surís et al. [49]. We implement object detection using IDEA-Research/grounding-dino-base [32] with text_threshold=box_threshold=0.2, image-text-matching using openai/clip-vit-large-patch14-336 [39] using 0.8 similarity threshold for detection, and the underlying visual question answering module is Salesforce/blip2-flan-t5-xxl [28] loaded in 8-bits using BitsAndBytes with a maximum batch size of 4 and generation hyperparameters length_penalty=-1, num_beams=5, max_length=10,min_length=1,do_sample=False, top_p=0.9, repetition_penalty=1.0, and temperature=1 for QA and set length_penalty=1 and max_length=30 for captioning. All models are served by HuggingFace.
C.3 In-Context Examples
We present the in-context examples used for visual question answering and image-text matching in Codes LABEL:code:vqa_ice and LABEL:code:itm_ice respectively. Code execution is handled using multiprocessing with a batch size of 30, and a timeout of 120 seconds, after which a TimeOutException is raised if execution exceeds the limit.
Appendix D Unit Test Generation
D.1 Implementation Details
To generate the unit test imaage descriptions and expected answers we prompt meta-llama/Meta-Llama-3-8B-Instruct, executed via VLLM with the following generation parameters: temperature=0.7, top_p=0.9, top_k=0.0, max_new_tokens=512, and num_beams=1. We return 3 output sequences, from which we extract the unit tests, deduplicate them, and filter answers longer than five words since they are out of distribution to the task before feeding them to the sampling module.
D.2 In-Context Examples
We prompt the LLM with the system prompt presented below, as well as in-context examples presented in Codes LABEL:code:ut_gen_ice_vqa and LABEL:code:ut_gen_ice_itm for VQA and ITM respectively.
You are a skilled AI assistant specialized in generating test cases for programs that respond to queries about images.
D.3 Unit Test Candidate Generation
We experiment with two prompting methodologies for the unit test generation: Query-Only and Query+Implementation. The former only takes into account the user query to generate the unit-tests, while the latter takes into account also each generated program. We prompt the Visual Program Generator in the same way, but instead also include implementation examples, and the current implementation as shown in Code LABEL:code:vqa_ice_implementation.
D.4 Image Generation
To generate the images we use the diffusers library, and prompt each of the models with generation hyperaparameters guidance_scale=16.0 and num_inference_steps=50. In the case of NSFW image generation, we update the seed by 1 and regenerate an image up to 10 times. Effectively, all unit tests have a corresponding image. We use the following implementations: CompVis/stable-diffusion-v1-4 for SDv1.4, longlian/lmd_plus for LM Guided Diffusion, and stabilityai/stable-diffusion-xl-base-1.0 for SDXL3.
D.4.1 LM Grounded Diffusion
To generate the bounding boxes and phrases for LM Grounded Diffusion we prompt meta-llama/Meta-Llama-3-8B-Instruct, executed via VLLM with the following generation parameters: temperature=1.0, top_p=0.9, top_k=0.0, max_new_tokens=320, and num_beams=1. We return 5 candidate sequences to collect multiple candidates since we notice that often the extracted phrases can be empty, leading to failure in image generation. We present the prompt and in-context examples used for this part in Code LABEL:code:lm_grounded.
Appendix E Strategies for Visual Unit Test Generation
E.1 Unit Test Sampler
Figure 10 illustrates the impact of different sampling strategies with varying the number of unit tests and program configurations. Our results indicate that ‘Coverage by Answer then Input’, consistently outperforms other methods. To gain deeper insights, we categorize the questions into three groups: Spatial, Attribute, and Other. For GQA, we classify any question groups containing Attr as Attribute and those mentioning location or position as Spatial. Figure 11 presents the average performance across scenarios with at least five unit tests and three program configurations. Notably, the Coverage by Answer Then Input strategy emerges as the most effective for questions in the Attribute category.


E.2 Image Generator
Figure 12 shows the impact of various diffusion models across different numbers of unit tests and program configurations. Our analysis reveals that LM-Guided diffusion consistently outperforms other methods, particularly in scenarios with more programs, where the likelihood of finding a suitable program for execution is higher. To gain deeper insights, figure 11 presents the average performance across scenarios with at least three unit tests and two program configurations on the categories introduced in the previous subsection. To provide a deeper understanding, Figure 13 illustrates the average performance across scenarios involving at least three unit tests and two program configurations, focusing on the categories defined earlier. Notably, LM-Guided diffusion proves most effective for questions in the Spatial category, highlighting the advantages of more controllable generation in achieving higher spatial fidelity.


E.3 Scoring function
Figure 14 highlights the impact of error penalties across varying configurations of unit tests and programs. While their effect becomes negligible in higher-resource configurations with more programs and unit tests, error penalties prove beneficial in lower-resource settings. In these scenarios, they help prioritize the selection of executable programs, thereby improving performance. Notably, runtime error penalties are more impactful for GQA, whereas compilation error penalties play a larger role in WinoGround. This difference likely stems from the higher complexity of WinoGround programs, which are more prone to compilation errors.

E.4 Aggregate Scorer
Figure 15 illustrates the impact of various aggregator functions on accuracy. Among these, mean score aggregation consistently outperforms other methods, particularly in configurations with a higher number of programs. In the case of WinoGround, however, max aggregation also performs competitively, occasionally surpassing mean aggregation. This is likely due to the binary nature of the answers in WinoGround and the increased likelihood of selecting correct for incorrect reasons programs.

Appendix F Visual Unit Test Utilization Methods
F.1 Best Program Selection
Tab 7 shows additional results on best program selection with varrying number of programs.
VQA | Image-Text Matching | |||||
LLM | # Prog | # UT | GQA | Winoground | SugarCREPE | Avg. |
Base Setup | ||||||
gpt-4o-mini | 1 | 0 | 42.03±1.21 | 44.98±0.75 | 38.75±0.47 | 41.92±0.81 |
CodeLlama-7B | 1 | 0 | 35.99±2.94 | 38.83±0.45 | 30.54±0.99 | 35.12±1.46 |
CodeGemma-7B | 1 | 0 | 41.83±2.26 | 39.60±1.38 | 42.56±1.52 | 41.33±1.72 |
Most Common Answer Setup | ||||||
CodeLlama-7B | 2 | 0 | 27.76±0.41 | 36.19±0.66 | 32.02±2.25 | 31.99±1.11 |
CodeLlama-7B | 3 | 0 | 35.99±0.70 | 42.40±0.85 | 37.26±2.70 | 38.55±1.42 |
CodeLlama-7B | 4 | 0 | 38.71±1.61 | 42.12±0.60 | 39.17±2.01 | 40.00±1.41 |
CodeLlama-7B | 5 | 0 | 42.50±1.50 | 45.85±0.77 | 41.67±1.79 | 43.34±1.35 |
CodeGemma-7B | 2 | 0 | 31.87±0.80 | 33.04±0.67 | 36.37±1.62 | 33.76±1.03 |
CodeGemma-7B | 3 | 0 | 40.31±1.00 | 40.50±1.33 | 44.58±0.55 | 41.80±0.96 |
CodeGemma-7B | 4 | 0 | 40.44±0.53 | 43.06±1.89 | 44.46±1.17 | 42.66±1.20 |
CodeGemma-7B | 5 | 0 | 43.89±0.98 | 46.04±1.48 | 46.67±1.69 | 45.53±1.38 |
ViUniT Setup (Ours) | ||||||
CodeLlama-7B | 2 | 5 | 41.90±1.74 | 46.65±1.63 | 40.24±0.82 | 42.93±1.40 |
CodeLlama-7B | 3 | 5 | 45.68±0.94 | 48.54±0.37 | 43.93±1.09 | 46.05±0.80 |
CodeLlama-7B | 4 | 5 | 49.07±2.39 | 50.17±0.54 | 45.65±1.22 | 48.30±1.38 |
CodeLlama-7B | 5 | 5 | 49.27±1.13 | 49.73±0.73 | 47.02±1.19 | 48.67±1.02 |
CodeGemma-7B | 2 | 5 | 44.02±0.72 | 49.27±0.57 | 46.73±2.30 | 46.67±1.20 |
CodeGemma-7B | 3 | 5 | 46.08±0.41 | 51.17±1.98 | 48.93±1.86 | 48.73±1.42 |
CodeGemma-7B | 4 | 5 | 47.88±1.36 | 52.25±1.35 | 50.83±1.32 | 50.32±1.34 |
CodeGemma-7B | 5 | 5 | 48.01±1.05 | 51.92±0.90 | 51.85±2.16 | 50.59±1.37 |
F.2 Answer Refusal
Figure 16 shows additional statistics on answer refusal, in particular the accuracy of selecting programs that will provide the final answer and the programs that succeed on the unit tests at different thresholds.

F.3 Re-prompting
F.3.1 Implementation Details
We consider an application of the unit tests to generate different candidate programs if the generated program falls below a threshold. To do so, we maintain the same hyperparameters in the program generator, but adapt the prompt to include the outputs of the unit tests as well as use suitable in context examples as shown in Codes LABEL:code:viunit_reprompting_vqa and LABEL:code:viunit_reprompting_itm for VQA and ITM respectively.
Error Reprompting Baseline
We employ the same model and hyperparamters as the reprompting, but instead adapt the prompt to take into account the error messages instead of the unit tests as shown in Codes LABEL:code:error_reprompting_vqa and LABEL:code:error_reprompting_itm for VQA and ITM respectively.
F.3.2 Additional Results
Table 8 presents the results of an additional reprompting iteration, highlighting that while continues to achieve higher performance overall, there is a slight drop in accuracy compared to the previous iteration. This decline can be attributed to its attempts to refine programs that may already produce correct answers for the wrong reasons. Such corrections can inadvertently cause shifts in the generated answers, leading to decreased accuracy despite the method’s focus on improving program fidelity.
VQA | Image-Text Matching | ||||||
LLM | Iter. | # Prog | # UT | GQA | Winoground | SugarCREPE | Avg. |
Base Setup (Iteration = 0) | |||||||
CodeLlama-7B | 0 | 1 | 0 | 35.99±2.94 | 38.83±0.45 | 30.54±0.99 | 35.12±1.46 |
CodeGemma-7B | 0 | 1 | 0 | 41.83±2.26 | 39.60±1.38 | 42.56±1.52 | 41.33±1.72 |
Error Reprompting | |||||||
CodeLlama-7B | 1 | 1 | 0 | 37.92±2.68 | 42.46±0.57 | 33.21±0.64 | 37.86±1.30 |
CodeLlama-7B | 2 | 1 | 0 | 38.78±2.22 | 44.58±0.44 | 37.08±1.08 | 40.15±1.25 |
CodeGemma-7B | 1 | 1 | 0 | 42.63±2.42 | 42.42±1.91 | 44.52±1.05 | 42.63±2.42 |
CodeGemma-7B | 2 | 1 | 0 | 42.90±2.65 | 43.08±1.73 | 45.30±0.92 | 42.90±2.65 |
ViUniT Reprompting (Ours) | |||||||
CodeLlama-7B | 1 | 1 | 5 | 46.68±2.52 | 51.85±0.40 | 47.68±2.17 | 48.74±1.69 |
CodeLlama-7B | 2 | 1 | 5 | 46.95±1.33 | 52.04±0.83 | 48.04±1.64 | 49.01±1.26 |
CodeGemma-7B | 1 | 1 | 5 | 45.75±0.30 | 48.19±2.28 | 48.21±1.12 | 47.38±1.23 |
CodeGemma-7B | 2 | 1 | 5 | 44.42±1.00 | 49.25±2.66 | 48.81±1.19 | 47.49±1.62 |
F.4 Reward Design for Reinforcement Learning
F.4.1 Implementation Details
Table 9 contains additional hyperparameters used for training. Each RL epoch requires about 30 minutes with correctness reward, and 90 minutes with reward since it requires execution of unit tests.
Parameter | Value | ||||
---|---|---|---|---|---|
warmup_ratio | 0.1 | ||||
max_grad_norm | 0.3 | ||||
lr_scheduler_type | linear | ||||
learning_rate | 2e-4 | ||||
lora_config.r | 16 | ||||
lora_config.lora_alpha | 32 | ||||
lora_config.lora_dropout | 0.05 | ||||
lora_config.bias | none | ||||
lora_config.target_modules |
|
F.4.2 Additional Analysis
Table 10 highlights the reduced error rates—measured as the number of programs leading to exceptions—achieved using the reward. Additionally, Table 11 presents the results of cross-task and cross-dataset generalization on policies trained with GQA, following the approach of [23]. For VQAv2 [11], we sample 10 questions for each of the 50 most common answers from the validation split of the compositional subset curated by [43], similar to [23]. For OKVQA [35], we sample 10 questions per question type, resulting in a total of 110 questions. The results indicate that while both reward types demonstrate strong generalization across tasks and datasets, the
reward consistently delivers superior performance.
VQA | Image-Text Matching | |||||
LLM | # Prog | # UT | GQA | Winoground | SugarCREPE | Avg. |
Supervised Correctness Reward | ||||||
CodeLlama-7B | 1 | 0 | 15.14±7.74 | 8.21±1.72 | 20.06±3.62 | 14.47±4.36 |
CodeGemma-7B | 1 | 0 | 9.10±9.35 | 13.25±6.30 | 12.86±4.41 | 11.73±6.69 |
Unsupervised ViUniT Reward (Ours) | ||||||
CodeLlama-7B | 1 | 0 | 9.56±2.13 | 10.31±1.55 | 15.42±3.03 | 11.76±2.24 |
CodeGemma-7B | 1 | 0 | 1.99±0.91 | 5.81±0.49 | 6.25±1.02 | 4.68±0.80 |
X-Dataset Generalization | X-Task Generalization | |||||
LLM | # Prog | # UT | VQAv2 | OK-VQA | Winoground | SugarCREPE |
Base Setup | ||||||
CodeLlama-7B | 1 | 0 | 25.67±2.20 | 16.09±2.02 | 30.54±0.99 | 35.12±1.46 |
CodeGemma-7B | 1 | 0 | 36.40±1.44 | 27.58±2.48 | 42.56±1.52 | 41.33±1.72 |
Supervised Correctness Reward | ||||||
CodeLlama-7B | 1 | 0 | 34.33±7.82 | 24.12±5.98 | 41.02±3.05 | 37.14±6.48 |
CodeGemma-7B | 1 | 0 | 42.47±6.03 | 28.12±6.20 | 47.98±4.98 | 39.94±11.58 |
Unsupervised ViUniT Reward (Ours) | ||||||
CodeLlama-7B | 1 | 0 | 35.87±2.31 | 25.64±0.91 | 43.63±2.89 | 44.35±3.18 |
CodeGemma-7B | 1 | 0 | 44.00±4.20 | 36.85±3.48 | 51.78±0.41 | 49.23±2.54 |
Appendix G End-to-End Fallback Methods
G.1 Implementation Details
G.1.1 VQA
For VQA we revert to ask the query directly to Salesforce/blip2-flan-t5-xxl [28] loaded in 8-bits using BitsAndBytes with a maximum batch size of 4 and generation hyperparameters length_penalty=-1, num_beams=5, max_length=10,min_length=1,do_sample=False, top_p=0.9, repetition_penalty=1.0, and temperature=1.
G.1.2 Image-Text-Matching
For image-text-matching we revert to openai/clip-vit-large-patch14-336 [39] using 0.8 similarity threshold for positive match, and negative otherwise.
G.2 Results with Fallback Method on Exception
In this work, we report results without employing a fallback method on exceptions, treating such cases as failures to better assess the quality of programs generated by different methods. However, it is common in the literature to report accuracy with a fallback method applied on exceptions. In Table 12 we present the best program selection results using this fallback approach on error.
VQA | Image-Text Matching | |||||
LLM | # Prog | # UT | GQA | Winoground | SugarCREPE | Avg. |
Base Setup | ||||||
gpt-4o-mini† | 1 | 0 | 43.76±1.72 | 51.94±0.56 | 49.46±1.25 | 48.39±1.17 |
CodeLlama-7B† | 1 | 0 | 44.75±2.01 | 51.65±1.09 | 48.57±0.82 | 48.32±1.31 |
CodeGemma-7B† | 1 | 0 | 44.82±2.30 | 47.23±2.26 | 50.18±0.71 | 47.41±1.76 |
Most Common Answer Setup | ||||||
CodeLlama-7B† | 5 | 0 | 49.07±2.79 | 51.29±0.87 | 46.79±1.29 | 49.05±1.65 |
CodeGemma-7B† | 5 | 0 | 46.61±1.24 | 49.10±1.32 | 49.17±1.52 | 48.29±1.36 |
ViUniT Setup (Ours) | ||||||
CodeLlama-7B† | 5 | 5 | 49.27±1.33 | 49.73±0.73 | 47.02±1.19 | 48.67±1.08 |
CodeGemma-7B† | 5 | 5 | 48.14±1.02 | 51.92±0.90 | 51.85±2.16 | 50.63±1.36 |
Appendix H Human Evaluation
This section presents details on the human evaluations on the quality of unit tests, and program correctness. We used Google-Forms to conduct the evaluations.
H.1 Unit Test Evaluation
To assess the quality of unit tests we randomly sample 20 exampels from each of the three datasets, each corresponding to 5 unit tests, resulting in a total of 300 unit tests for evaluation. The unit tests were judged by three independent annotators, instructed with Is the answer answer correct given the image?, were answer was populated with the unit test expected answer, with binary yes/no answers. Table 13 breaks down the results showing that on average 75% of unit tests are correct. Then the annotators optionally annotated the reason of failure by selecting from “Missing Object”, “Spatial Error”, “Incomplete object”, “Color Mismatch”, or “Other”. Figure 17 shows the break down by error type, highlighting “Missing Object” as the most common source of error.
GQA | WinoGround | SugarCREPE | Avg. | ||||
---|---|---|---|---|---|---|---|
Acc. | Acc. | Acc. | Acc. | ||||
68.00 | 0.39 | 75.00 | 0.70 | 82.00 | 0.67 | 75.00 | 0.58 |

H.2 Program Correctness Evaluation
To assess the improvements on program quality by applying we conduct a human evaluation to rate GQA programs generated by the Base Setup and the programs selected from 5 candidate programs and 5 unit tests. Two annotators with 3+ years of Python experience graded programs using the following grading scheme: “Correct: The code accurately and fully answers the query.” (0), “Partially Correct: The code answers the query but has some issues.” (1), “Incorrect: The code does not answer the query correctly.” (2), and “Irrelevant: The code is unrelated to the query.” (3). In addition, they were optionally asked to select the source of error from “Missing Condition”, “Incorrect Logic”, “Irrelevant to the query”, “Wrong Conditions”, “Missing Checks (e.g. could get list index out of range)”, “Performance Issues”, “Other”. Table 14 shows the break down of program correctness improvements using
and Figure 18 shows the error types identified in each method.
has “Missing Checks” as the most common error type, which mostly involves cases of not checking array length before accessing indices, typically still leading to correct solutions with reasonable programs, whereas the main culprit for program incorrectness in the base setup is “Incorrect Logic”.
Base Setup | ViUniT Setup (Ours) | |
Fully Correct () | 77% | 86% |
Partially Correct () | 86% | 95% |
Incorrect () | 14% | 5% |
Irrelevant () | 4% | 0% |
0.24 | 0.30 | |
0.59 | 0.40 |

Appendix I Limitations and Social Ethics Impact
I.1 Limitations
While provides significant advancements in the logical correctness and robustness of visual programs, our framework has several limitations that present opportunities for future enhancement. First, although
improves program selection and execution by leveraging unit tests, it does not fully eliminate the issue of programs being correct for the wrong reasons, as shown by the human evaluation in Table 14. Our approach does not provide a formal guarantee of logical correctness, as it relies on automatically generated tests to evaluate candidate programs. Addressing this challenge opens avenues for integrating formal verification methods and more sophisticated testing strategies to further enhance program correctness. Second, while we optimize for maximizing input and output coverage during unit test generation, it is possible that the generated tests do not fully capture the space of edge cases or subtle logical errors in complex programs. This limitation highlights the potential for future work to develop more comprehensive coverage metrics and testing methodologies, possibly incorporating code-line execution coverage or other verifiable metrics. Third, the improved accuracy and robustness achieved by
, as seen in Table 1, come with an increase in computational effort. Generating candidate programs, sampling unit tests, and executing them on generated images introduce additional overhead. This trade-off between accuracy and efficiency presents an exciting challenge for future research to optimize the framework for real-time or resource-constrained applications, possibly through algorithmic improvements or efficient execution strategies. Additionally, enhancing the explainability of program failures remains an area for further development. Providing clear and interpretable feedback when a program is rejected or not selected due to poor performance on unit tests can improve user trust and facilitate debugging. Future work could focus on combining unit test outputs to offer detailed explanations of program failures. Finally, while
has demonstrated effectiveness on VQA and ITM tasks, exploring its applicability to other domains or tasks involving different modalities or reasoning paradigms presents an opportunity to extend its impact. Adapting the framework to diverse domains can unlock new possibilities and broaden its utility. Despite these limitations, the advancements introduced by
lay a strong foundation for future innovations in visual programming. By addressing these challenges, we can further enhance the robustness, efficiency, and applicability of the framework.
I.2 Social Ethics Impact
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/daf30cdc-4cbc-4348-979b-5a70ca54291d/x55.png)
enhances the robustness and correctness of visual programming, with applications in critical domains like autonomous driving, healthcare, and education. By reducing instances where programs are correct for the wrong reasons, it helps build more trustworthy AI systems. However, ethical considerations are crucial for its responsible deployment: First, relies on pre-trained models, which may propagate biases (e.g., gender, racial, or cultural). Future work should focus on integrating bias detection and correction into unit test generation to promote fairness. Second, computational demands may limit access for resource-constrained organizations. Advancing efficiency and optimization can broaden accessibility and foster inclusivity. Third, increased computational needs may raise energy consumption. Optimizing for energy efficiency and using renewable energy can reduce the environmental impact, while improved AI reliability could deliver long-term sustainability benefits. Finally, in sensitive domains like healthcare or law, rigorous validation and transparency are essential. Finally, in sensitive domains such as healthcare or legal decision-making, while
has the potential to enhance the correctness of visual programs, it is crucial to carefully communicate the framework’s limitations and ensure rigorous validation. By proactively addressing ethical challenges and focusing on responsible development, we can maximize the positive societal impact of
, paving the way for more reliable, fair, and trustworthy AI systems.
Appendix J Qualitative Examples

