: Visual Unit Tests for More Robust Visual Programming

Artemis Panagopoulou^†, Honglu Zhou^‡ Silvio Savarese^‡ Caiming Xiong^‡
Chris Callison-Burch^† Mark Yatskar^† Juan Carlos Niebles^‡
^‡Salesforce AI Research ^†University of Pennsylvania
https://artemisp.github.io/viunit/ Work done during internship at Salesforce.

Abstract

Programming based approaches to reasoning tasks have substantially expanded the types of questions models can answer about visual scenes. Yet on benchmark visual reasoning data, when models answer correctly, they produce incorrect programs 33% of the time. These models are often right for the wrong reasons and risk unexpected failures on new data. Unit tests play a foundational role in ensuring code correctness and could be used to repair such failures. We propose Visual Unit Testing (ViUniT), a framework to improve the reliability of visual programs by automatically generating unit tests. In our framework, a unit test is represented as a novel image and answer pair meant to verify the logical correctness of a program produced for a given query. Our method leverages a language model to create unit tests in the form of image descriptions and expected answers and image synthesis to produce corresponding images. We conduct a comprehensive analysis of what constitutes an effective visual unit test suite, exploring unit test generation, sampling strategies, image generation methods, and varying the number of programs and unit tests. Additionally, we introduce four applications of visual unit tests: best program selection, answer refusal, re-prompting, and unsupervised reward formulations for reinforcement learning. Experiments with two models across three datasets in visual question answering and image-text matching demonstrate that ViUniT improves model performance by 11.4%. Notably, it enables 7B open-source models to outperform gpt-4o-mini by an average of 7.7% and reduces the occurrence of programs that are correct for the wrong reasons by 40%.

1 Introduction

Refer to caption — Figure 1: Framework Overview. Given a query $q$ about an image, the unit test generator $\psi$ generates a set $\mathcal{T}_{\text{cand}}=\psi(q,p)$ of $M$ candidate pairs $t_{i}=(c_{i},y_{i})$ , each consisting of an image caption $c_{i}$ and an expected answer $y_{i}$ (Section 3.2.1). The coverage sampler $\sigma$ then subsamples $K$ pairs from $\mathcal{T}_{\text{cand}}$ , forming the subset $\mathcal{T}_{K}$ (Section 3.2.2). These captions are passed to an image generator $M$ to create the corresponding images $v_{i}=M(c_{i})$ for each unit test (Section 3.2.3). Each candidate program is subsequently executed, and gets assigned a score $S(p)$ by the scorer $H$ based on its performance on the unit tests (Section 3.3). Finally, the highest scoring program is selected.

Visual Programming [14, 49], which involves generating executable programs that leverage state-of-the-art specialist systems (e.g. object detection, captioning, etc.), has emerged as an effective method for tackling compositional reasoning tasks [48, 15]. Often correct visual programs must be inferred without training programs because they are expensive to annotate. Recently, some methods improve the performance of visual program synthesis by leveraging programs that yield correct results on training data [23, 29]. While these approaches have shown improvements, a critical limitation persists: visual programs can be right for the wrong reasons. For example, human evaluation of 100 visual programs resulting in correct responses generated by CodeLlama-7B¹¹1CodeLlama-7B [41] is a leading open source large language model (LLM) for questions in GQA [20], showed that only 33% of them were actually correct and 70% of the incorrect programs (23% of the total) would require significant rewriting to be correct.

To mitigate this prevailing issue, we propose Visual Unit Testing (), a framework for automatically generating unit tests for visual programs. While automatic unit test generation has gained momentum in text-based tasks [5, 46, 2, 12, 50], its application to visual program synthesis has been limited. Recent efforts toward visual units tests focused primarily on checking program return value types (e.g. the output falling outside a range of options, like yes or no) [25]. However, this approach does not assess the program’s execution or logical correctness, limiting the types of errors it can address. In this work, we bridge this gap by addressing challenges that have hindered unit test use in visual question answering (VQA) and image-text-matching (ITM).

As seen in Figure 1, visual programming converts queries to code that executes on test images to provide a response. For such programs, unit tests take the form of images and expected answers. Units test are difficult to construct because they need to have sufficient coverage to diagnose errors. To solve this problem, we leverage language models to generate candidate sets of descriptions of images that could test the code (Section 3.2.1). We formulate an optimization criterion to select ones that maximize coverage of possible program inputs and outputs (Section 3.2.2), and convert selected descriptions to images (Section 3.2.3). Our approach is entirely unsupervised with no accompanying annotations.

Unit tests can be used to identify incorrect programs but integrating this signal to improve model behavior is challenging. In Section 3.4 we explore several mechanisms, summarized in Figure 2, including:

wide

Best program selection: Given a set of program candidates we select the one that passes the most test cases. This approach achieves a 7.7-point improvement over gpt-4o-mini (Table 1) and reduces right-for-wrong-reason programs by 40% (Section 7).
wiide

Re-prompting: We use unit test outputs to guide the generation of improved programs when initial programs perform poorly on the unit test suite. Relative to regeneration without unit tests, programs are over 3% more accurate (Table 3).
wiiide

Unsupervised Reinforcement Learning (RL) Reward Design: We use unit test scores as feedback to fine-tune an LLM on programs more likely correct for the right reasons, surpassing supervised correctness-based rewards by an average of 1.3 points across tasks (Table 4).
wivde

Answer refusal: Unit test scores are used to assess program confidence, reverting to an end-to-end model if the program is not robust, achieving up to 0.8 F1 score in correctly refusing programs that would fail (Figure 9).

To summarize our contributions, we present , the first framework to introduce unit tests that verify the logical correctness of visual programs. We conduct a broad exploration of unit test generation configurations (Section 5), showing that maximizing coverage is an important criterion. We introduce four ways to leverage unit-tests to improve models (Section 3.4): best program selection, answer refusal, re-prompting, and unsupervised reward design for reinforcement learning. Overall, integrating unit-tests improves frozen-LLM accuracy by 11.4% and enables 7B open-source LLMs to outperform proprietary models like gpt-4o-mini by an average of 7.7 points, while improving underlying code correctness. Broader adoption of unit-test suits will significantly enchase robustness and trust of visual programming approaches.

2 Related Work

Visual Program Synthesis: The recent advancements in LLMs [1, 33, 53, 54, 4, 3, 21, 36] have led to their use as a planning interface for the modularization of tools to execute complex reasoning tasks involving multiple modalities [34, 42, 47, 9, 29] and as a reasoning module for visual agents [58, 59, 57, 16]. Specialized coding LLMs [41, 27, 51, 13] have demonstrated significant potential in addressing visual challenges by generating executable code based on contextual demonstrations [49, 14, 15, 10, 55] with comparable or better performance to vision language models [28, 37, 31, 6]. Attempts to improve the initial paradigm involve automatically generating a pool of effective programs to retrieve as in-context examples [48] and tuning a model through reinforcement learning by sampling programs that succeed on the training set [23]. More relevant to this work, Hu et al. [19] distill program reasoning into a VLM as chain-of-thought reasoning by generating multiple programs per query and selecting the best one, either by using the ground truth answer as a proxy for correctness or by having it evaluated by an LLM. However, a critical issue remains: some generated programs achieve correct outcomes without sound reasoning, which we address in this paper.

LLM Unit Test Generation: Unit tests have been used as reinforcement learning signal to train code-generating LLMs [27, 5, 44, 12, 45, 7]. Existing methods for automatic unit test generation with LLMs [5, 2, 12, 50] focus primarily on text-based tasks, generating entire unit test scripts. However, these approaches often result in issues like compilation errors, low coverage, redundant assertions, and empty tests [46]. Recent work [25] proposes property testing on the outputs of visual programs by leveraging LLMs to generate properties that should be satisfied by the output given the query (e.g. the output should be a color if the query asks for one). Yet, this method inherits many limitations of LLM generated script-based unit testing, and crucially, it fails to assess logical correctness—meaning it overlooks cases where program outputs may be right for the wrong reasons. Instead, we propose a method of generating unit tests to verify the execution of visual programs, without requiring an LLM to directly generate unit-test scripts, avoiding such issues that tend to accompany the automatic generation of unit tests using LLMs. In particular, we use LLMs to generate image descriptions and expected answers without requiring any direct code generation. Image descriptions and expected answers are then transformed to a unit test using a text-to-image diffusion model [40].

3 Method

In this section, we formalize the tasks of visual program synthesis and unit test generation (Section 3.1) and introduce our framework (Section 3.2). Our method comprises two main components: unsupervised generation of visual unit tests (Section 3.2) and unit test scoring (Section 3.3). We propose four ways to leverage unit tests in Section 3.4: Best Program Selection, Answer Refusal, Re-Prompting, and Unsupervised RL Reward Design.

3.1 Task Definition

Visual Program Synthesis: Given a visual input $v$ and a textual query $q$ about $v$ , our goal is to synthesize a program $p$ that correctly answers $q$ about $v$ . Each program $p\in\mathcal{P}$ is executed on the visual input $v$ using an execution engine $\phi$ , yielding a predicted answer $\hat{y}=\phi(p,v)$ . Our objective is to select the program $p^{\ast}$ that is most likely to produce the correct answer $y^{\ast}$ to the query $q$ about $v$ , formalized as:

\displaystyle\vspace{-.4cm}p^{\ast}

\displaystyle=\arg\max_{p\in\mathcal{P}}\Pr\left(\phi(p,v)\equiv y^{\ast}\right).

(1)

Visual Unit Testing: To assess the candidate programs, we employ a unit test generator $\psi$ , which generates a set of unit tests $\mathcal{T}=\psi(q)$ . Each unit test $t_{i}\in\mathcal{T}$ consists of a test visual input $v_{i}$ and the corresponding correct answer $y_{i}$ to the query $q$ on that input $t_{i}=(v_{i},y_{i})$ . For each candidate program $p\in\mathcal{P}$ , we execute it on all test inputs $v_{i}$ to obtain outputs $\hat{y}_{i}=\phi(p,v_{i}),~{}\text{for }t_{i}\in\mathcal{T}$ .

3.2 Unsupervised Visual Unit Test Generation

Given a program $p$ to solve a query $q$ , our goal is to generate a set of unit tests $\mathcal{T}$ comprising input images and expected answers, as shown in Figure 3. This process involves three main steps: Candidate Unit Test Generation (Section 3.2.1), Unit Test Sampling (Section 3.2.2), and Image Generation (Section 3.2.3).

3.2.1 Candidate Unit Test Generation $\psi$

As illustrated in Figure 1, rather than generating images directly for unit tests, we first create image descriptions with expected answers. This approach reduces computational overhead during the preliminary stage of unit test coverage sampling, after which we generate images only for those tests that are included in the final unit test suite $\mathcal{T}$ . In particular, we first generate a superset of $M$ candidate unit tests using the unit test generator $\psi$ , which is implemented as an auto-regressive large language model. The unit test generator $\psi$ can take both the query $q$ and the program implementation $p$ as inputs $\mathcal{T}_{\text{cand}}=\psi(q,p)=\{t_{1},t_{2},\ldots,t_{M}\}$ . Each candidate unit test $t_{i}$ consists of an image caption $c_{i}$ and an expected answer $y_{i}$ . We explore whether including the program implementation $p$ provides useful signals for unit test generation (Section 5), despite conventional engineering practices that advocate for implementation-independent unit tests. This allows us to investigate whether this principle extends to visual unit testing.

3.2.2 Unit Test Coverage Sampling $\sigma$

Unit tests verify the behavior of code and should exhibit high isolation and coverage [24]. In the context of visual programs, isolation is trivial since each program is a self-contained function. However, achieving high coverage—ensuring that the tests collectively exercise as much of the codebase as possible—is non-trivial due to the computational overhead of executing all candidate tests. To address this, we define coverage metrics tailored for visual programming unit tests, focusing on maximizing the diversity of both expected answers and visual inputs. The coverage sampler $\sigma$ subsamples $K$ pairs from $\mathcal{T}_{\text{cand}}$ , forming the subset $\mathcal{T}_{K}$ .
Coverage by Answer: We aim to include tests that cover all possible expected answers present in the candidate set. Let $Y=\{y_{i}\mid t_{i}\in\mathcal{T}_{\text{cand}}\}$ be the set of all expected answers in $\mathcal{T}_{\text{cand}}$ . We define the answer diversity criterion as ensuring that for every possible answer $y\in Y$ , there is at least one test $t_{i}\in\mathcal{T}_{K}$ such that $y_{i}=y$ :

\displaystyle\vspace{-.6cm}\forall y\in Y,\quad\exists t_{i}\in\mathcal{T}_{K}\text{ such that }y_{i}\equiv y.

(2)

Coverage by Input: To maximize the diversity of visual inputs without generating all possible images, we operate on the image captions. We define an encoding function $E$ that maps a caption $c$ to a feature vector. We aim to maximize the input diversity score $\sigma_{V}(\mathcal{T}_{K})$ , defined as the maximum pairwise distance between the encoded captions:

\displaystyle\vspace{-.4cm}\sigma_{V}(\mathcal{T}_{K})

\displaystyle=\max_{t_{i},t_{j}\in\mathcal{T}_{K},\,i\neq j}\left\|E(c_{i})-E(c_{j})\right\|\vspace{-.2cm}

(3)

This encourages the selection of tests with diverse descriptions, which in turn is likely to yield diverse images.
Coverage by Answer then Input: We begin by selecting one test for each possible answer to satisfy the answer diversity criterion (Equation 2). Then, we iteratively select additional tests to maximize $\sigma_{V}(\mathcal{T}_{K})$ using the following criterion until $K$ tests are selected, forming the subset $\mathcal{T}_{K}$ .

\displaystyle\vspace{-.4cm}t_{\text{new}}

\displaystyle=\arg\max_{t\in\mathcal{T}_{\text{cand}}\setminus\mathcal{T}_{K}}\max_{t^{\prime}\in\mathcal{T}_{K}}\left\|E(c_{t})-E(c_{t^{\prime}})\right\|.\vspace{-.2cm}

(4)

3.2.3 Image Generation $M$

For each selected unit test $t_{i}=(c_{i},y_{i})\in\mathcal{T}_{K}$ , we generate the corresponding image $v_{i}$ using a text-to-image model $M$ to yield the final unit-test suite $\mathcal{T}=\{(M(c_{i}),y_{i})\mid\forall t_{i}\in\mathcal{T}_{K}\}$ . We employ three state-of-the-art diffusion models: SDv1.4 [40], SDXL3 [38], and LM Guided Diffusion [30] which utilizes automatically generated templates with phrases and bounding boxes for spatial conditioning [30]. To provide these additional signals, we prompt an LLM with in-context examples and the caption $c_{i}$ to generate pairs of phrases and bounding boxes $(ph_{i},bb_{i})$ to feed into the text-to-image model: $v_{i}=M(c_{i},(ph_{i},bb_{i}))$ .

3.3 Program Selection Based on Unit Test Scores

We select the program $p^{\ast}$ that succeeds on most unit tests by Equation 6, where the overall score $S(p)$ is computed by an aggregator $H$ over individual scores $s_{t_{i}}=h(\hat{y_{i}},y_{i})$ .

Individual Unit Test Scorer $h$ : For each program $p$ and test $t_{i}=(v_{i},y_{i})\in\mathcal{T}_{K}$ , we execute $p$ on $v_{i}$ to obtain the predicted answer $\hat{y}_{i}=\phi(p,v_{i})$ . We define a scoring function $h$ that assigns a score $s_{t_{i}}$ based on the program’s output:

\displaystyle\vspace{-.4cm}s_{t_{i}}

\displaystyle=h(\hat{y}_{i},y_{i})=\begin{cases}-\epsilon_{r},&\text{if runtime error},\\ -\epsilon_{c},&\text{if compilation error},\\ \mathbb{I}\{\hat{y}_{i}\equiv y_{i}\},&\text{otherwise}\end{cases}

(5)

where $\epsilon_{r}$ and $\epsilon_{c}$ are runtime and compilation error penalties and $\mathbb{I}$ is the indicator function.

Score Aggregator $H$ : The individual scores $s_{t_{i}}$ are aggregated to compute an overall score $\small S(p)=H(\{s_{t_{i}}\mid t_{i}\in\mathcal{T}\})$ . Here, $H$ represents the averaging function. The program $p^{\ast}$ with the highest score is selected as the best candidate approximating Equation 1 by:

\displaystyle\vspace{-.6cm}p^{\ast}

\displaystyle=\arg\max_{p\in\mathcal{P}}S(p).\vspace{-.2cm}

(6)

3.4 Visual Unit Test Utilization Methods

Figure 2 illustrates how to leverage visual unit tests in four ways, further elaborated below:

Best Program Selection: Given a set of candidate programs $\mathcal{P}=\{p_{1},p_{2},\ldots,p_{N}\}$ for a query $q$ , our goal is to select the program $p^{\ast}$ that is most likely to produce the correct answer when executed on the visual input $v$ . We utilize the unit test scores $S(p)$ computed for each program $p\in\mathcal{P}$ as described in Section 3.3. The best program–the program succeeds on most unit tests– is selected by solving the optimization problem in Equation 6.
Answer Refusal: If the maximum unit test score $S(p^{\ast})$ falls below a threshold $\theta$ , indicating low confidence in all candidate programs, we refuse to provide a programmatic answer. Instead, we retreat to an end-to-end fallback method (refer to supplement for details). Formally, the decision rule is: $\text{If }S(p^{\ast})<\theta,\text{ refuse to answer and redirect}$ . Otherwise, we proceed to execute the selected program $p^{\ast}$ on the original visual input $v$ to obtain the final answer $\hat{y}=\phi(p^{\ast},v)$ . The hyperparameter $\theta$ balances a trade-off between attempting to answer with potentially incorrect programs and deferring to a more reliable but less interpretable method.
Re-Prompting: If all generated programs $\mathcal{P}$ fail to meet the threshold $\theta$ (i.e., $\max_{p\in\mathcal{P}}S(p)<\theta$ ), we employ a re-prompting strategy to generate better candidate programs using feedback from unit tests:

\displaystyle\vspace{-.6cm}\mathcal{P}^{\prime}

\displaystyle=\pi\left(x^{\prime}(q)+\mathcal{F}\right)\vspace{-.4cm}

(7)

where: $x^{\prime}(q)$ is an adaptation of the original input containing the API, the query $q$ , and in-context examples of unit-test-feedback corrections, and $\mathcal{F}$ is the feedback derived from unit test results ²²2 $\mathcal{F}$ comprises unit test image descriptions, expected answers, and the predicted answers generated by the program in the current iteration., summarizing the discrepancies between expected and actual outputs, and $\pi$ is the program generator.

We select the best program $p^{\ast\ast}$ from the new set $\mathcal{P}^{\prime}$ based on their unit test scores $p^{\ast\ast}=\arg\max_{p^{\prime}\in\mathcal{P}^{\prime}}S(p^{\prime})$ . If $S(p^{\ast\ast})\geq\theta$ , we execute $p^{\ast\ast}$ on the original visual input $v$ . Otherwise, we may repeat the re-prompting process until a predefined number of iterations is reached.
Unsupervised Reinforcement Learning Reward Design We propose to design RL rewards based on visual unit tests, aiming not only to provide extra supervision but also curtail policy deterioration due to logically incorrect programs [23]. The goal is to optimize a policy implemented as an autoregressive language model for program generation $\pi_{w}$ , parameterized by $w$ , by minimizing the reward-weighted loss over the dataset $D$ , where each example consists of an image $v$ , user query $q$ , generated program $p$ by the previous iteration’s policy $\pi_{w^{\text{itr}-1}}$ , and ground truth answer $y$ :

\displaystyle\vspace{-.4cm}J(w)

\displaystyle=\mathbb{E}_{(v,q,p,y)\sim D}\left[R(v,p,y)\,L_{\text{NLL}}(p,q;w)\right],\vspace{-.2cm}

(8)

where $\small L_{\text{NLL}}(p,q;w)=-\sum_{l=1}^{L}\log\pi_{w}(p_{l}|p_{1:l-1},x(q))$ is the negative log-likelihood loss on next token prediction and $L$ is the sequence length .

Khan et al. [23] introduce a correctness reward based on performance on the training set:

\displaystyle\vspace{-.4cm}R_{\text{Correct}}(v,p,y)

\displaystyle=\begin{cases}1,&\text{if }\phi(p,v)\equiv y,\\ 0,&\text{otherwise}.\end{cases}\vspace{-.4cm}

(9)

However, this approach can lead to sparse rewards and may falsely reward programs that are right for incorrect reasons. Khan et al. [23] address this issue through human corrections to stabilize training. Instead we reformulate the reward using feedback from the visual unit tests:

\displaystyle\vspace{-.4cm}R_{\text{ViUnit}}(v,p)

\displaystyle=\begin{cases}1,&\text{if }S(p)\geq\theta,\\ S(p),&\text{otherwise},\end{cases}\vspace{-.4cm}

(10)

where $\theta$ is a passing threshold. We terminate policy iteration on declining reward. Following earlier work [22], we assume that an optimal policy will keep increasing an optimal reward function $R^{\ast}$ . Thus, when our proxy reward $R$ declines (i.e., regret increases), there are theoretical guarantees that we are not far from the optimal policy that can be learned under $R$ .

4 Experimental Setup

Below is the experimental setup: datasets (Section 4.1), baselines (Section 4.2), and implementation details (Section 4.3).

4.1 Data

We utilize three compositional reasoning datasets: GQA [20] for Visual Question Answering (VQA), SugarCREPE [17], and Winoground [52] for Image-Text Matching (ITM), assessing model performance via accuracy metrics. For GQA, we calculate accuracy using an implementation by Surís et al. [49], which standardizes and compares generated answers for exact matches.³³3https://github.com/cvlab-columbia/viper/blob/main/datasets/gqa.py Our experimental setup incorporates training and testing splits sampled similar to Khan et al. [23], specifically testing on 502 examples from the GQA balanced-val split and training on 1022 examples from the balanced-train split, with 10 samples per question group. In SugarCREPE, we utilize 788 examples for training by subsampling approximately 10% of the dataset balanced across question types, excluding our validation split. The validation subset consists of 560 examples and includes both positive and negative image-text pairings from 40 samples from each of the 7 question types. The full Winoground dataset is used, encompassing all possible positive and negative pairings for a total of 1600 test examples, with the SugarCREPE dataset employed for training purposes. Refer to the supplement for further dataset details.

4.2 Baselines

We evaluate against the following baselines:
Base Setup: Following the prototypical use of visual programs [49, 14], we prompt the LLM to generate a single program per query, which is executed to retrieve a response.
Most Common Answer: To leverage multiple programs, we compare performance with selecting the most common answer across executed programs if one exists.
Error Re-prompting: To evaluate the effectiveness of unit-test incorporation in program correction via unit-test re-prompting, we benchmark performance against a method that leverages error-traces as feedback $\mathcal{F}$ in Equation 7. Further details are provided in the supplement.
Correctness Reward: We baseline unsupervised unit-test RL reward fomulation against the supervised correctness reward described by Equation 9.

4.3 Implementation Details

We provide a summary of key implementation details, with additional information in the supplement. Experiments were conducted on two A100 40GB GPUs, though a single GPU suffices for smaller API models. Results report the mean and standard deviation across 3 runs.
Program Generation Models: Three program generator models are employed, codellama/CodeLlama-7b-Python-hf [41] and google/codegemma-7b-it [51] hosted on Hugginface and served by VLLM [26], as well as gpt-4o-mini [1] served by OpenAI. We use HuggingFace’s SFT-Trainer to train the RL policy using LoRA [18] with $\theta=0.8$ in Equation 10. Models are prompted with an API adapted from ViperGPT [49] and 4 in-context examples.
API Models: Object detection is performed using IDEA-Research/grounding-dino-base [32]. For image-text matching, we use openai/clip-vit-large-patch14-336 [39], and for VQA answering, we employ Salesforce/blip2-flan-t5-xxl [28]. All models are accessed through HuggingFace.
Unit Test Generation Models: We use meta-llama/Meta-Llama-3-8B-Instruct [8] to generate image descriptions and expected answers for unit test candidates. The unit test sampler is implemented with sentence-transformers, using the all-MiniLM-L6-v2 [56] model to embed image descriptions. For image generation, we use the diffusers library, specifically CompVis/stable-diffusion-v1-4 for SDv1.4, longlian/lmd_plus for LM Guided Diffusion, and stabilityai/stable-diffusion-xl-base-1.0 for SDXL3.
Program Scoring and Execution: Program executions are capped at 120 seconds. Unit test scoring error penalties are set to $\epsilon_{r}=\epsilon_{c}=0.1$ (Equation 5). Unless specified, no end-to-end model retreat was employed on exception.

5 Strategies for Visual Unit Test Generation

We explore different unit test generation configurations applied on best program selection using a smaller dataset of three questions from each group in GQA, and each tag in WinoGround, yielding 303 and 504 samples, respectively.

Number of unit tests $K$ . Figure 5 illustrates that increasing both the number of unit tests and the number of candidate programs improves accuracy on both datasets. Accuracy rises substantially with the addition of unit tests, particularly from 1 to 5 tests, after which gains diminish. Higher numbers of programs (e.g., 4 or 5) consistently yield better accuracy compared to fewer programs, underscoring the benefit of exploring multiple candidate solutions.

Unit Test Generator $\psi$ . Figure 6 demonstrates that in low unit test settings, incorporating program information into unit test generation yields comparable results to query-only approaches. However, as the number of unit tests and programs increases, disregarding implementation details proves significantly more effective. This aligns with software engineering best practices, where unit tests are designed to remain independent of specific implementations.

Unit Test Sampler $\sigma$ . Figure 7 demonstrates the impact of different unit test sampling methods on model accuracy. In GQA, “Coverage By Answer then Input” shows increasing performance as the number of unit tests grows, thus allowing the saturation of possible answers. Figure 4(a) highlights limitations of the other methods: “Coverage by Input” may suffer from reduced answer diversity, and “Coverage by Answer” could involve repetitive inputs. In WinoGround there is negligible difference across methods, due to its restriction to two answers, preventing significant sampling diversity. Nevertheless, an analysis of performance by question-type in the supplement shows that this sampling method yields higher results for attribute-related queries in both datasets.

Image Generator $M$ . Figure 8 illustrates the impact of different diffusion models. In GQA at lower unit test settings LM Guided diffusion yields some accuracy improvements, while for WinoGround, LM Guided diffusion only helps in lower program settings, with quick convergence as the number of program increases. The benefit of LM Guided diffusion is primarily driven by improved tests when spatial positioning is critical as shown with the result breakdowns in the supplement and illustrated in Figure 4(b).

Scoring function $h$ . The supplement presents results with varying error penalties, illustrating that in few unit test settings imposing error penalties enhances the likelihood of selecting a successful program.

			VQA	Image-Text Matching
LLM	# Prog	# UT	GQA	Winoground	SugarCREPE	Avg.
Base Setup
gpt-4o-mini	1	0	42.03_±1.21	44.98_±0.75	38.75_±0.47	41.92_±0.81
CodeLlama-7B	1	0	35.99_±2.94	38.83_±0.45	30.54_±0.99	35.12_±1.46
CodeGemma-7B	1	0	41.83_±2.26	39.60_±1.38	42.56_±1.52	41.33_±1.72
Most Common Answer Setup
CodeLlama-7B	5	0	42.50_±1.50	45.85_±0.77	41.67_±1.79	43.34_±1.35
CodeGemma-7B	5	0	43.89_±0.98	46.04_±1.48	46.67_±1.69	45.53_±1.38
ViUniT Setup (Ours)
CodeLlama-7B	5	5	49.27_±1.33	49.73_±0.73	47.02_±1.19	48.67_±1.08
CodeGemma-7B	5	5	48.01_±1.05	51.92_±0.90	51.85_±2.16	50.59_±1.37

Table 1: Accuracy on Best Program Selection. Bold is best.

			VQA	Image-Text Matching
LLM	# Prog	# UT	GQA	Winoground	SugarCREPE	Avg.
Reverting on Error
CodeLlama-7B	1	0	44.89_±2.04	51.67_±1.16	49.29_±0.99	48.61_±1.40
CodeGemma-7B	1	0	44.89_±2.19	47.25_±2.17	49.58_±0.88	47.24_±1.74
Reverting on ViUniT Threshold $\theta=0.7$ (Ours)
CodeLlama-7B	1	5	54.18_±0.40	50.67_±1.28	49.05_±0.82	51.30_±0.84
CodeGemma-7B	1	5	54.58_±1.24	50.73_±0.94	50.12_±1.62	51.81_±1.27

Table 2: Answer Refusal: Reverting to end-to-end model on error or unit test passing failure (

\theta=0.7

). Bold is best.

				VQA	Image-Text Matching
LLM	Iter.	# Prog	# UT	GQA	Winoground	SugarCREPE	Avg.
Error Reprompting
CodeLlama-7B	1	1	0	37.92_±2.68	42.46_±0.57	33.21_±0.64	37.86_±1.30
CodeGemma-7B	1	1	0	42.63_±2.42	42.42_±1.91	44.52_±1.05	42.63_±2.42
ViUniT Reprompting $\theta=0.7$ (Ours)
CodeLlama-7B	1	1	5	46.68_±2.52	51.85_±0.40	47.68_±2.17	48.74_±1.69
CodeGemma-7B	1	1	5	45.75_±0.30	48.19_±2.28	48.21_±1.12	47.38_±1.23

Table 3: Accuracy of different re-prompting methods. Bold is best.

Supervised Correctness Reward
			VQA	Image-Text Matching
LLM	# Prog	# UT	GQA	Winoground	SugarCREPE	Avg.
CodeLlama-7B	1	0	39.18_±4.88	48.65_±0.87	39.58_±2.75	42.47_±2.83
CodeGemma-7B	1	0	43.03_±5.08	45.98_±2.64	46.31_±2.26	45.11_±3.33
Unsupervised ViUniT Reward (Ours)
CodeLlama-7B	1	0	40.57_±2.10	46.52_±0.81	41.85_±1.44	42.98_±1.45
CodeGemma-7B	1	0	45.68_±2.45	49.29_±0.43	46.55_±0.69	47.17_±1.19

Table 4: Comparison of RL with supervised correctness rewards versus unsupervised unit-test-based rewards. Bold is best.

6 Strategies of Visual Unit Test Utilization

Best Program Selection: Table 1 underscores the efficacy of selection in identifying the most optimal program. Our approach demonstrates a notable average improvement of 11.4 accuracy points over the base setup and a substantial 7.7-point average gain over the gpt-4o-mini configuration. Furthermore, it surpasses most common answer selection by an average margin of 5.2 points.

Answer Refusal: Figure 9 illustrates the impact of varying the threshold $\theta$ on the F1 score of refusing programs with incorrect answers (left), and the false pass failure rate (right), measured relative to the total number of programs. The minimal false pass failure rate at higher thresholds supports the use of unit test scores as a proxy for correctness during unsupervised model fine-tuning. Table 2 showcases an improvement of 3.6 points of reverting to a fixed model when $S(p)<\theta=0.7$ compared to reverting only on error. For CodeLlama-7B, performance on image-text matching is similar between the two methods, as some programs yield correct answers despite failing unit tests. Although such programs impact final performance, a human inspection of 40 samples revealed that 65% were unreliable from the start.

Re-prompting: Table 3 demonstrates that re-prompting with achieves an average improvement of 7.5 points over error-based re-prompting, with a notable 10.9-point increase for CodeLlama-7B, which performs lower in the base setting. The unit tests offer additional opportunities for refining the method’s initial response, as they go beyond error detection to assess program confidence, while also providing a measure of comparison between the programs.

RL Reward Design: The pattern of improvements is particularly interesting in the RL setting, where we find that rewards outperform correctness rewards by an average of 1.3 points in accuracy despite not relying on the training labels. Additionally, we observe a notable reduction in the percentage of code leading to exceptions; errors decrease from 14.47% to 11.76% for CodeLlama and even more sharply from 11.73% to 4.68% for CodeGemma. These results indicate that heavily rewarding higher-quality code, as filtered through unit tests, encourages the development of a more robust and error-resistant policy.

7 Human Evaluation

We summarize key findings from two human evaluations that assess unit test quality and improvements in program reliability. Full details are available in the supplement.
Unit Test Evaluation: We randomly sampled 20 examples from each of three datasets, each corresponding to 5 unit tests, resulting in a total of 300 unit tests, each of which was judged by three annotators. Based on the majority annotator response, 75% of unit tests per sample were correct. Annotators could optionally comment on errors, with “Missing Object” noted as the most frequent issue.
Program Evaluation: To measure the effectiveness of unit tests in enhancing program reliability, we evaluated 100 VQA programs that correctly answered the queries both from the base and the unit-test best program selection setups. Two annotators with 3+ years of Python experience graded programs from 0 (Fully Correct) to 3 (Irrelevant). Under the unit test setup, 86% of programs were fully correct, compared to 77% in the base setup. Additionally, only 5% of programs were marked completely incorrect—with none deemed irrelevant—compared to 14% and 4%, respectively, in the base setup. Notably, the most common error type shifted from “Incorrect Logic” in the base setup to “Missing Checks (e.g., list index out of range)” in the unit-test setup.

8 Conclusion and Future Work

We introduce , the first framework to automatically generate unit tests for verifying visual program correctness, addressing cases where programs may appear correct for the wrong reasons. Unit tests are leveraged in four ways: best program selection (+11.4 points over the base setup and +7.7 points over gpt4o-mini), answer refusal, re-prompting, and unsupervised RL reward design (+1.3 points over supervised rewards). Future directions include fine-grained test generation and broader task applications. By reinforcing logical correctness, advances robustness and interpretability in visual programs.

References

Achiam et al. [2023] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
Alagarsamy et al. [2024] Saranya Alagarsamy, Chakkrit Tantithamthavorn, and Aldeida Aleti. A3test: Assertion-augmented automated test case generation. Information and Software Technology, 176:107565, 2024.
Anil et al. [2023] Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
Chen et al. [2023] Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. Codet: Code generation with generated tests. In The Eleventh International Conference on Learning Representations, 2023.
Dai et al. [2023] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
Dou et al. [2024] Shihan Dou, Yan Liu, Haoxiang Jia, Enyu Zhou, Limao Xiong, Junjie Shan, Caishuang Huang, Xiao Wang, Xiaoran Fan, Zhiheng Xi, Yuhao Zhou, Tao Ji, Rui Zheng, Qi Zhang, Tao Gui, and Xuanjing Huang. StepCoder: Improving code generation with reinforcement learning from compiler feedback. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4571–4585, Bangkok, Thailand, 2024. Association for Computational Linguistics.
Dubey et al. [2024] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
Gao et al. [2024] Zhi Gao, Yuntao Du, Xintong Zhang, Xiaojian Ma, Wenjuan Han, Song-Chun Zhu, and Qing Li. Clova: A closed-loop visual assistant with tool usage and update. Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
Ge et al. [2025] Jiaxin Ge, Sanjay Subramanian, Baifeng Shi, Roei Herzig, and Trevor Darrell. Recursive visual programming. In European Conference on Computer Vision, pages 1–18. Springer, 2025.
Goyal et al. [2017] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
Guilherme and Vincenzi [2023] Vitor Guilherme and Auri Vincenzi. An initial investigation of chatgpt unit test generation capability. In Proceedings of the 8th Brazilian Symposium on Systematic and Automated Software Testing, pages 15–24, 2023.
Guo et al. [2024] Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024.
Gupta and Kembhavi [2023] Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953–14962, 2023.
[15] Cheng Han, James Chenhao Liang, Qifan Wang, MAJID RABBANI, Sohail Dianat, Raghuveer Rao, Ying Nian Wu, and Dongfang Liu. Image translation as diffusion visual programmers. In The Twelfth International Conference on Learning Representations.
Hong et al. [2024] Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14281–14290, 2024.
Hsieh et al. [2024] Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kembhavi, and Ranjay Krishna. Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality. Advances in neural information processing systems, 36, 2024.
Hu et al. [2022] Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
Hu et al. [2024] Yushi Hu, Otilia Stretcu, Chun-Ta Lu, Krishnamurthy Viswanathan, Kenji Hata, Enming Luo, Ranjay Krishna, and Ariel Fuxman. Visual program distillation: Distilling tools and programmatic reasoning into vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9590–9601, 2024.
Hudson and Manning [2019] Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
Jiang et al. [2023] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
[22] Jacek Karwowski, Oliver Hayman, Xingjian Bai, Klaus Kiendlhofer, Charlie Griffin, and Joar Max Viktor Skalse. Goodhart’s law in reinforcement learning. In The Twelfth International Conference on Learning Representations.
Khan et al. [2024] Zaid Khan, Vijay Kumar BG, Samuel Schulter, Yun Fu, and Manmohan Chandraker. Self-training large language models for improved visual program synthesis with visual reinforcement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14344–14353, 2024.
Khorikov [2020] Vladimir Khorikov. Unit Testing Principles, Practices, and Patterns. Simon and Schuster, 2020.
Koo et al. [2024] Jaywon Koo, Ziyan Yang, Paola Cascante-Bonilla, Baishakhi Ray, and Vicente Ordonez. PropTest: Automatic property testing for improved visual programming. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 8241–8256, Miami, Florida, USA, 2024. Association for Computational Linguistics.
Kwon et al. [2023] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023.
Le et al. [2022] Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Hoi. CodeRL: Mastering code generation through pretrained models and deep reinforcement learning. In Advances in Neural Information Processing Systems, 2022.
Li et al. [2023] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023.
Li et al. [2024] Zhuowan Li, Bhavan Jasani, Peng Tang, and Shabnam Ghadar. Synthesize step-by-step: Tools templates and llms as data generators for reasoning-based chart vqa. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13613–13623, 2024.
Lian et al. [2024] Long Lian, Boyi Li, Adam Yala, and Trevor Darrell. LLM-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models. Transactions on Machine Learning Research, 2024. Featured Certification.
Liu et al. [2024a] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024a.
Liu et al. [2024b] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In European Conference on Computer Vision. Springer, 2024b.
Longpre et al. [2023] Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. The flan collection: Designing data and methods for effective instruction tuning. In International Conference on Machine Learning, pages 22631–22648. PMLR, 2023.
Lu et al. [2024] Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao. Chameleon: Plug-and-play compositional reasoning with large language models. Advances in Neural Information Processing Systems, 36, 2024.
Marino et al. [2019] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019.
Nijkamp et al. [2023] Erik Nijkamp, Tian Xie, Hiroaki Hayashi, Bo Pang, Congying Xia, Chen Xing, Jesse Vig, Semih Yavuz, Philippe Laban, Ben Krause, et al. Xgen-7b technical report. arXiv preprint arXiv:2309.03450, 2023.
Panagopoulou et al. [2024] Artemis Panagopoulou, Le Xue, Ning Yu, Junnan Li, Dongxu Li, Shafiq Joty, Ran Xu, Silvio Savarese, Caiming Xiong, and Juan Carlos Niebles. X-instructblip: A framework for aligning x-modal instruction-aware representations to llms and emergent cross-modal reasoning. In Proceedings of the European Conference on Computer Vision (ECCV), 2024.
Podell et al. [2024] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. In The Twelfth International Conference on Learning Representations, 2024.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022.
Roziere et al. [2023] Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
Schick et al. [2024] Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36, 2024.
Selvaraju et al. [2020] Ramprasaath R Selvaraju, Purva Tendulkar, Devi Parikh, Eric Horvitz, Marco Tulio Ribeiro, Besmira Nushi, and Ece Kamar. Squinting at vqa models: Introspecting vqa models with sub-questions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10003–10011, 2020.
Shen et al. [2023] Bo Shen, Jiaxin Zhang, Taihong Chen, Daoguang Zan, Bing Geng, An Fu, Muhan Zeng, Ailun Yu, Jichuan Ji, Jingyang Zhao, et al. Pangu-coder2: Boosting large language models for code with ranking feedback. arXiv preprint arXiv:2307.14936, 2023.
[45] Parshin Shojaee, Aneesh Jain, Sindhu Tipirneni, and Chandan K Reddy. Execution-based code generation using deep reinforcement learning. Transactions on Machine Learning Research.
Siddiq et al. [2023] Mohammed Latif Siddiq, Joanna Santos, Ridwanul Hasan Tanvir, Noshin Ulfat, FA Rifat, and V Carvalho Lopes. Exploring the effectiveness of large language models in generating unit tests. arXiv preprint arXiv:2305.00418, 2023.
Singh et al. [2019] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019.
[48] Aleksandar Stanić, Sergi Caelles, and Michael Tschannen. Towards truly zero-shot compositional visual reasoning with llms as programmers. Transactions on Machine Learning Research.
Surís et al. [2023] Dídac Surís, Sachit Menon, and Carl Vondrick. Vipergpt: Visual inference via python execution for reasoning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11888–11898, 2023.
Takerngsaksiri et al. [2024] Wannita Takerngsaksiri, Rujikorn Charakorn, Chakkrit Tantithamthavorn, and Yuan-Fang Li. Tdd without tears: Towards test case generation from requirements through deep reinforcement learning. arXiv preprint arXiv:2401.07576, 2024.
Team [2024] CodeGemma Team. Codegemma: Open code models based on gemma. arXiv preprint arXiv:2406.11409, 2024.
Thrush et al. [2022] Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5228–5238. IEEE Computer Society, 2022.
Touvron et al. [2023a] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
Touvron et al. [2023b] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
Ukai et al. [2024] Mahiro Ukai, Shuhei Kurita, Atsushi Hashimoto, Yoshitaka Ushiku, and Nakamasa Inoue. Adacoder: Adaptive prompt compression for programmatic visual question answering. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 9234–9243, 2024.
Wang et al. [2020] Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems, 33:5776–5788, 2020.
Wei et al. [2024] Yuxi Wei, Zi Wang, Yifan Lu, Chenxin Xu, Changxing Liu, Hao Zhao, Siheng Chen, and Yanfeng Wang. Editable scene simulation for autonomous driving via collaborative llm-agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15077–15087, 2024.
Yang et al. [2022] Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Yumao Lu, Zicheng Liu, and Lijuan Wang. An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI conference on artificial intelligence, pages 3081–3089, 2022.
Yang et al. [2023] Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023.

Appendix A Data

The three compositional reasoning datasets used in this work are GQA [20], SugarCREPE [17], and WinoGround [52]. Table 5 shows examples from each dataset, and table 6 summarizes the dataset statistics. For GQA validation we sample 5 questions from each of the 102 question groups from the balanced-val split with a total of 502 examples. For testing, we sample 10 questions per group from the balanced-train split yielding 1022 examples. Note that some groups such as typeVerifyC, stateChoose, and companyVerify do not have a sufficient amount of questions, so we sample the whole group. For SugarCREPE, we utilize 788 examples for training by subsampling 10% of the dataset balanced across the 7 question types, excluding our validation split. This validation subset consists of 560 examples and includes both positive and negative image-text pairings from 40 samples from each of the 7 question types. The full Winoground dataset is used, encompassing all possible positive and negative pairings for a total of 1600 examples, with the SugarCREPE dataset employed for training.

Image	Question	Answer
GQA
	Are there any guys to the right of the brown horse?	no
	Which direction is the animal that looks white and brown looking at?	forward
	What type of animal is that fence behind of, an elephant or a giraffe?	giraffe
SugarCREPE
	Is there a white pitcher holding flowers in a window sill?	yes
	Are a cat and a dog napping together under a blanket on the couch?	no
	Is a dog sitting in front of a laptop on top of a bed?	yes
WinoGround
	Verify image matches text=“two humans and one wheel”	yes
	Verify image matches text=“red building with white shutters”	no
	Verify image matches text=“the person with the white collared shirt waters the plant while the other holds it”	yes

Table 5: Dataset Samples

# Samples	# Images	# Questions	# Answers	# Question Types	# Questions/Type
GQA
1022/502	1014/487	937/474	176/122	105/102	10/5
WinoGround
-/1600	-/800	-/800	-/2	-/70	-/8
SugarCREPE
788/560	335/260	765/557	2/2	7/7	52/80

Table 6: Dataset Statistics: Values are shown in {train/test} format. For SugarCREPE and WinoGround, both positive and negative image-text pairings are included. In GQA, question types are divided by the data field group, and in WinoGround by the data field tag. The training data for WinoGround consists of SugarCREPE.

Appendix B Unit Test Sampling Pseudocode

For clarity, Algorithm 1 presents the pseudocode for the unit test coverage sampling method described in Section 3.

Algorithm 1 Unit Test Sampling Algorithm

T=\{t_{1},t_{2},\dots,t_{n}\}

, the set of texts

A=\{a_{1},a_{2},\dots,a_{m}\}

, the set of answers

f:T\rightarrow A

, a function mapping each text to an answer

E(t)

, embedding function for text

t

k

, number of samples

6:use_answers, a boolean flag

S

, a subset of

T

of size

k

8:function SampleTexts(T, A, f, E, k, use_answers)

9: Initialize

S\leftarrow\emptyset

10: if

\texttt{\small use\_answers}=\text{True}

then

11: for each

a_{i}\in A

12: Select

t

from

T

such that

f(t)=a_{i}

13:

S\leftarrow S\cup\{t\}

14:

T\leftarrow T\setminus\{t\}

15: end for

16: else

17: Select a random

t

from

T

18:

S\leftarrow\{t\}

19:

T\leftarrow T\setminus\{t\}

20: end if

21: while

|S|<k

22:

s_{\text{new}}\leftarrow\arg\max_{t\in T}\max_{s\in S}\|E(t)-E(s)\|

23:

S\leftarrow S\cup\{s_{\text{new}}\}

24:

T\leftarrow T\setminus\{s_{\text{new}}\}

25: end while

26: return

S

27:end function

Appendix C Program Generation and Execution

In this section, we outline the implementation details for program generation and execution.

C.1 Generation Details

For program generation we use in context examples both in of-the-shelf inference, and finetuned model inference. Generation is conducted using VLLM with the following generation parameters: temperature=1.0, top_p=0.9, top_k=0.0, max_new_tokens=320, and num_beams=1. We set the temperature at a high value to ensure diversity in generated programs. For CodeLLaMA we prefix the prompt with <s>, and for CodeGemma we enclose it in <bos><start_of_turn>[..]<end_of_turn>

C.2 Image Patch API

We present the ImagePatch API in Code LABEL:code:api_prompt which we adapt the from Khan et al. [23] which is in turn adapted from ViperGPT Surís et al. [49]. We implement object detection using IDEA-Research/grounding-dino-base [32] with text_threshold=box_threshold=0.2, image-text-matching using openai/clip-vit-large-patch14-336 [39] using 0.8 similarity threshold for detection, and the underlying visual question answering module is Salesforce/blip2-flan-t5-xxl [28] loaded in 8-bits using BitsAndBytes with a maximum batch size of 4 and generation hyperparameters length_penalty=-1, num_beams=5, max_length=10,min_length=1,do_sample=False, top_p=0.9, repetition_penalty=1.0, and temperature=1 for QA and set length_penalty=1 and max_length=30 for captioning. All models are served by HuggingFace.

C.3 In-Context Examples

We present the in-context examples used for visual question answering and image-text matching in Codes LABEL:code:vqa_ice and LABEL:code:itm_ice respectively. Code execution is handled using multiprocessing with a batch size of 30, and a timeout of 120 seconds, after which a TimeOutException is raised if execution exceeds the limit.

Appendix D Unit Test Generation

D.1 Implementation Details

To generate the unit test imaage descriptions and expected answers we prompt meta-llama/Meta-Llama-3-8B-Instruct, executed via VLLM with the following generation parameters: temperature=0.7, top_p=0.9, top_k=0.0, max_new_tokens=512, and num_beams=1. We return 3 output sequences, from which we extract the unit tests, deduplicate them, and filter answers longer than five words since they are out of distribution to the task before feeding them to the sampling module.

D.2 In-Context Examples

We prompt the LLM with the system prompt presented below, as well as in-context examples presented in Codes LABEL:code:ut_gen_ice_vqa and LABEL:code:ut_gen_ice_itm for VQA and ITM respectively.

You are a skilled AI assistant specialized in generating test cases for programs that respond to queries about images.

D.3 Unit Test Candidate Generation

We experiment with two prompting methodologies for the unit test generation: Query-Only and Query+Implementation. The former only takes into account the user query to generate the unit-tests, while the latter takes into account also each generated program. We prompt the Visual Program Generator in the same way, but instead also include implementation examples, and the current implementation as shown in Code LABEL:code:vqa_ice_implementation.

D.4 Image Generation

To generate the images we use the diffusers library, and prompt each of the models with generation hyperaparameters guidance_scale=16.0 and num_inference_steps=50. In the case of NSFW image generation, we update the seed by 1 and regenerate an image up to 10 times. Effectively, all unit tests have a corresponding image. We use the following implementations: CompVis/stable-diffusion-v1-4 for SDv1.4, longlian/lmd_plus for LM Guided Diffusion, and stabilityai/stable-diffusion-xl-base-1.0 for SDXL3.

D.4.1 LM Grounded Diffusion

To generate the bounding boxes and phrases for LM Grounded Diffusion we prompt meta-llama/Meta-Llama-3-8B-Instruct, executed via VLLM with the following generation parameters: temperature=1.0, top_p=0.9, top_k=0.0, max_new_tokens=320, and num_beams=1. We return 5 candidate sequences to collect multiple candidates since we notice that often the extracted phrases can be empty, leading to failure in image generation. We present the prompt and in-context examples used for this part in Code LABEL:code:lm_grounded.

Appendix E Strategies for Visual Unit Test Generation

E.1 Unit Test Sampler $\sigma$

Figure 10 illustrates the impact of different sampling strategies with varying the number of unit tests and program configurations. Our results indicate that ‘Coverage by Answer then Input’, consistently outperforms other methods. To gain deeper insights, we categorize the questions into three groups: Spatial, Attribute, and Other. For GQA, we classify any question groups containing Attr as Attribute and those mentioning location or position as Spatial. Figure 11 presents the average performance across scenarios with at least five unit tests and three program configurations. Notably, the Coverage by Answer Then Input strategy emerges as the most effective for questions in the Attribute category.

E.2 Image Generator $M$

Figure 12 shows the impact of various diffusion models across different numbers of unit tests and program configurations. Our analysis reveals that LM-Guided diffusion consistently outperforms other methods, particularly in scenarios with more programs, where the likelihood of finding a suitable program for execution is higher. To gain deeper insights, figure 11 presents the average performance across scenarios with at least three unit tests and two program configurations on the categories introduced in the previous subsection. To provide a deeper understanding, Figure 13 illustrates the average performance across scenarios involving at least three unit tests and two program configurations, focusing on the categories defined earlier. Notably, LM-Guided diffusion proves most effective for questions in the Spatial category, highlighting the advantages of more controllable generation in achieving higher spatial fidelity.

E.3 Scoring function $h$

Figure 14 highlights the impact of error penalties across varying configurations of unit tests and programs. While their effect becomes negligible in higher-resource configurations with more programs and unit tests, error penalties prove beneficial in lower-resource settings. In these scenarios, they help prioritize the selection of executable programs, thereby improving performance. Notably, runtime error penalties are more impactful for GQA, whereas compilation error penalties play a larger role in WinoGround. This difference likely stems from the higher complexity of WinoGround programs, which are more prone to compilation errors.

E.4 Aggregate Scorer $H$

Figure 15 illustrates the impact of various aggregator functions on accuracy. Among these, mean score aggregation consistently outperforms other methods, particularly in configurations with a higher number of programs. In the case of WinoGround, however, max aggregation also performs competitively, occasionally surpassing mean aggregation. This is likely due to the binary nature of the answers in WinoGround and the increased likelihood of selecting correct for incorrect reasons programs.

Appendix F Visual Unit Test Utilization Methods

F.1 Best Program Selection

Tab 7 shows additional results on best program selection with varrying number of programs.

			VQA	Image-Text Matching
LLM	# Prog	# UT	GQA	Winoground	SugarCREPE	Avg.
Base Setup
gpt-4o-mini	1	0	42.03_±1.21	44.98_±0.75	38.75_±0.47	41.92_±0.81
CodeLlama-7B	1	0	35.99_±2.94	38.83_±0.45	30.54_±0.99	35.12_±1.46
CodeGemma-7B	1	0	41.83_±2.26	39.60_±1.38	42.56_±1.52	41.33_±1.72
Most Common Answer Setup
CodeLlama-7B	2	0	27.76_±0.41	36.19_±0.66	32.02_±2.25	31.99_±1.11
CodeLlama-7B	3	0	35.99_±0.70	42.40_±0.85	37.26_±2.70	38.55_±1.42
CodeLlama-7B	4	0	38.71_±1.61	42.12_±0.60	39.17_±2.01	40.00_±1.41
CodeLlama-7B	5	0	42.50_±1.50	45.85_±0.77	41.67_±1.79	43.34_±1.35
CodeGemma-7B	2	0	31.87_±0.80	33.04_±0.67	36.37_±1.62	33.76_±1.03
CodeGemma-7B	3	0	40.31_±1.00	40.50_±1.33	44.58_±0.55	41.80_±0.96
CodeGemma-7B	4	0	40.44_±0.53	43.06_±1.89	44.46_±1.17	42.66_±1.20
CodeGemma-7B	5	0	43.89_±0.98	46.04_±1.48	46.67_±1.69	45.53_±1.38
ViUniT Setup (Ours)
CodeLlama-7B	2	5	41.90_±1.74	46.65_±1.63	40.24_±0.82	42.93_±1.40
CodeLlama-7B	3	5	45.68_±0.94	48.54_±0.37	43.93_±1.09	46.05_±0.80
CodeLlama-7B	4	5	49.07_±2.39	50.17_±0.54	45.65_±1.22	48.30_±1.38
CodeLlama-7B	5	5	49.27_±1.13	49.73_±0.73	47.02_±1.19	48.67_±1.02
CodeGemma-7B	2	5	44.02_±0.72	49.27_±0.57	46.73_±2.30	46.67_±1.20
CodeGemma-7B	3	5	46.08_±0.41	51.17_±1.98	48.93_±1.86	48.73_±1.42
CodeGemma-7B	4	5	47.88_±1.36	52.25_±1.35	50.83_±1.32	50.32_±1.34
CodeGemma-7B	5	5	48.01_±1.05	51.92_±0.90	51.85_±2.16	50.59_±1.37

Table 7: Accuracy on Best Program Selection with varying number of programs. Bold is best.

F.2 Answer Refusal

Figure 16 shows additional statistics on answer refusal, in particular the accuracy of selecting programs that will provide the final answer and the programs that succeed on the unit tests at different thresholds.

F.3 Re-prompting

F.3.1 Implementation Details

We consider an application of the unit tests to generate different candidate programs if the generated program falls below a threshold. To do so, we maintain the same hyperparameters in the program generator, but adapt the prompt to include the outputs of the unit tests as well as use suitable in context examples as shown in Codes LABEL:code:viunit_reprompting_vqa and LABEL:code:viunit_reprompting_itm for VQA and ITM respectively.

Error Reprompting Baseline We employ the same model and hyperparamters as the reprompting, but instead adapt the prompt to take into account the error messages instead of the unit tests as shown in Codes LABEL:code:error_reprompting_vqa and LABEL:code:error_reprompting_itm for VQA and ITM respectively.

F.3.2 Additional Results

Table 8 presents the results of an additional reprompting iteration, highlighting that while continues to achieve higher performance overall, there is a slight drop in accuracy compared to the previous iteration. This decline can be attributed to its attempts to refine programs that may already produce correct answers for the wrong reasons. Such corrections can inadvertently cause shifts in the generated answers, leading to decreased accuracy despite the method’s focus on improving program fidelity.

				VQA	Image-Text Matching
LLM	Iter.	# Prog	# UT	GQA	Winoground	SugarCREPE	Avg.
Base Setup (Iteration = 0)
CodeLlama-7B	0	1	0	35.99_±2.94	38.83_±0.45	30.54_±0.99	35.12_±1.46
CodeGemma-7B	0	1	0	41.83_±2.26	39.60_±1.38	42.56_±1.52	41.33_±1.72
Error Reprompting
CodeLlama-7B	1	1	0	37.92_±2.68	42.46_±0.57	33.21_±0.64	37.86_±1.30
CodeLlama-7B	2	1	0	38.78_±2.22	44.58_±0.44	37.08_±1.08	40.15_±1.25
CodeGemma-7B	1	1	0	42.63_±2.42	42.42_±1.91	44.52_±1.05	42.63_±2.42
CodeGemma-7B	2	1	0	42.90_±2.65	43.08_±1.73	45.30_±0.92	42.90_±2.65
ViUniT Reprompting $\theta=0.7$ (Ours)
CodeLlama-7B	1	1	5	46.68_±2.52	51.85_±0.40	47.68_±2.17	48.74_±1.69
CodeLlama-7B	2	1	5	46.95_±1.33	52.04_±0.83	48.04_±1.64	49.01_±1.26
CodeGemma-7B	1	1	5	45.75_±0.30	48.19_±2.28	48.21_±1.12	47.38_±1.23
CodeGemma-7B	2	1	5	44.42_±1.00	49.25_±2.66	48.81_±1.19	47.49_±1.62

Table 8: Accuracy of different re-prompting methods with an additional iteration. Bold is best.

F.4 Reward Design for Reinforcement Learning

F.4.1 Implementation Details

Table 9 contains additional hyperparameters used for training. Each RL epoch requires about 30 minutes with correctness reward, and 90 minutes with reward since it requires execution of unit tests.

Parameter

Value

warmup_ratio

0.1

max_grad_norm

0.3

lr_scheduler_type

linear

learning_rate

2e-4

lora_config.r

lora_config.lora_alpha

lora_config.lora_dropout

0.05

lora_config.bias

none

lora_config.target_modules

k_proj	v_proj
q_proj	o_proj

Table 9: RL training hyperparameters.

F.4.2 Additional Analysis

Table 10 highlights the reduced error rates—measured as the number of programs leading to exceptions—achieved using the reward. Additionally, Table 11 presents the results of cross-task and cross-dataset generalization on policies trained with GQA, following the approach of [23]. For VQAv2 [11], we sample 10 questions for each of the 50 most common answers from the validation split of the compositional subset curated by [43], similar to [23]. For OKVQA [35], we sample 10 questions per question type, resulting in a total of 110 questions. The results indicate that while both reward types demonstrate strong generalization across tasks and datasets, the reward consistently delivers superior performance.

Supervised Correctness Reward
			VQA	Image-Text Matching
LLM	# Prog	# UT	GQA	Winoground	SugarCREPE	Avg.
CodeLlama-7B	1	0	15.14_±7.74	8.21_±1.72	20.06_±3.62	14.47_±4.36
CodeGemma-7B	1	0	9.10_±9.35	13.25_±6.30	12.86_±4.41	11.73_±6.69
Unsupervised ViUniT Reward (Ours)
CodeLlama-7B	1	0	9.56_±2.13	10.31_±1.55	15.42_±3.03	11.76_±2.24
CodeGemma-7B	1	0	1.99_±0.91	5.81_±0.49	6.25_±1.02	4.68_±0.80

Table 10: Comparison of Error Rates in models trained with supervised correctness rewards versus unsupervised unit-test-based rewards. Lower is better. Bold is best.

Base Setup
			X-Dataset Generalization		X-Task Generalization
LLM	# Prog	# UT	VQAv2	OK-VQA	Winoground	SugarCREPE
CodeLlama-7B	1	0	25.67_±2.20	16.09_±2.02	30.54_±0.99	35.12_±1.46
CodeGemma-7B	1	0	36.40_±1.44	27.58_±2.48	42.56_±1.52	41.33_±1.72
Supervised Correctness Reward
CodeLlama-7B	1	0	34.33_±7.82	24.12_±5.98	41.02_±3.05	37.14_±6.48
CodeGemma-7B	1	0	42.47_±6.03	28.12_±6.20	47.98_±4.98	39.94_±11.58
Unsupervised ViUniT Reward (Ours)
CodeLlama-7B	1	0	35.87_±2.31	25.64_±0.91	43.63_±2.89	44.35_±3.18
CodeGemma-7B	1	0	44.00_±4.20	36.85_±3.48	51.78_±0.41	49.23_±2.54

Table 11: GQA policy generalization across tasks and datasets

Appendix G End-to-End Fallback Methods

G.1 Implementation Details

G.1.1 VQA

For VQA we revert to ask the query directly to Salesforce/blip2-flan-t5-xxl [28] loaded in 8-bits using BitsAndBytes with a maximum batch size of 4 and generation hyperparameters length_penalty=-1, num_beams=5, max_length=10,min_length=1,do_sample=False, top_p=0.9, repetition_penalty=1.0, and temperature=1.

G.1.2 Image-Text-Matching

For image-text-matching we revert to openai/clip-vit-large-patch14-336 [39] using 0.8 similarity threshold for positive match, and negative otherwise.

G.2 Results with Fallback Method on Exception

In this work, we report results without employing a fallback method on exceptions, treating such cases as failures to better assess the quality of programs generated by different methods. However, it is common in the literature to report accuracy with a fallback method applied on exceptions. In Table 12 we present the best program selection results using this fallback approach on error.

			VQA	Image-Text Matching
LLM	# Prog	# UT	GQA	Winoground	SugarCREPE	Avg.
Base Setup
gpt-4o-mini†	1	0	43.76_±1.72	51.94_±0.56	49.46_±1.25	48.39_±1.17
CodeLlama-7B†	1	0	44.75_±2.01	51.65_±1.09	48.57_±0.82	48.32_±1.31
CodeGemma-7B†	1	0	44.82_±2.30	47.23_±2.26	50.18_±0.71	47.41_±1.76
Most Common Answer Setup
CodeLlama-7B†	5	0	49.07_±2.79	51.29_±0.87	46.79_±1.29	49.05_±1.65
CodeGemma-7B†	5	0	46.61_±1.24	49.10_±1.32	49.17_±1.52	48.29_±1.36
ViUniT Setup (Ours)
CodeLlama-7B†	5	5	49.27_±1.33	49.73_±0.73	47.02_±1.19	48.67_±1.08
CodeGemma-7B†	5	5	48.14_±1.02	51.92_±0.90	51.85_±2.16	50.63_±1.36

Table 12: Accuracy on Best Program Selection using fallback method on exception (indicated by †). Bold is best.

Appendix H Human Evaluation

This section presents details on the human evaluations on the quality of unit tests, and program correctness. We used Google-Forms to conduct the evaluations.

H.1 Unit Test Evaluation

To assess the quality of unit tests we randomly sample 20 exampels from each of the three datasets, each corresponding to 5 unit tests, resulting in a total of 300 unit tests for evaluation. The unit tests were judged by three independent annotators, instructed with Is the answer answer correct given the image?, were answer was populated with the unit test expected answer, with binary yes/no answers. Table 13 breaks down the results showing that on average 75% of unit tests are correct. Then the annotators optionally annotated the reason of failure by selecting from “Missing Object”, “Spatial Error”, “Incomplete object”, “Color Mismatch”, or “Other”. Figure 17 shows the break down by error type, highlighting “Missing Object” as the most common source of error.

GQA		WinoGround		SugarCREPE		Avg.
Acc.	$\kappa$	Acc.	$\kappa$	Acc.	$\kappa$	Acc.	$\kappa$
68.00	0.39	75.00	0.70	82.00	0.67	75.00	0.58

Table 13: Human Evaluation of Unit Test Quality. Accuracy corresponds to how many unit tests from the total were accurate and

\kappa

is the mean Kohen Kappa across annotators.

H.2 Program Correctness Evaluation

To assess the improvements on program quality by applying we conduct a human evaluation to rate GQA programs generated by the Base Setup and the programs selected from 5 candidate programs and 5 unit tests. Two annotators with 3+ years of Python experience graded programs using the following grading scheme: “Correct: The code accurately and fully answers the query.” (0), “Partially Correct: The code answers the query but has some issues.” (1), “Incorrect: The code does not answer the query correctly.” (2), and “Irrelevant: The code is unrelated to the query.” (3). In addition, they were optionally asked to select the source of error from “Missing Condition”, “Incorrect Logic”, “Irrelevant to the query”, “Wrong Conditions”, “Missing Checks (e.g. could get list index out of range)”, “Performance Issues”, “Other”. Table 14 shows the break down of program correctness improvements using and Figure 18 shows the error types identified in each method. has “Missing Checks” as the most common error type, which mostly involves cases of not checking array length before accessing indices, typically still leading to correct solutions with reasonable programs, whereas the main culprit for program incorrectness in the base setup is “Incorrect Logic”.

	Base Setup	ViUniT Setup (Ours)
Fully Correct ( $\leq 1$ )	77%	86%
Partially Correct ( $<2$ )	86%	95%
Incorrect ( $\geq 2$ )	14%	5%
Irrelevant ( $>2$ )	4%	0%
$\kappa$	0.24	0.30
$\kappa_{bin}$	0.59	0.40

Table 14: Human Evaluation of Program Correctness. Bold is best.

Appendix I Limitations and Social Ethics Impact

I.1 Limitations

While provides significant advancements in the logical correctness and robustness of visual programs, our framework has several limitations that present opportunities for future enhancement. First, although improves program selection and execution by leveraging unit tests, it does not fully eliminate the issue of programs being correct for the wrong reasons, as shown by the human evaluation in Table 14. Our approach does not provide a formal guarantee of logical correctness, as it relies on automatically generated tests to evaluate candidate programs. Addressing this challenge opens avenues for integrating formal verification methods and more sophisticated testing strategies to further enhance program correctness. Second, while we optimize for maximizing input and output coverage during unit test generation, it is possible that the generated tests do not fully capture the space of edge cases or subtle logical errors in complex programs. This limitation highlights the potential for future work to develop more comprehensive coverage metrics and testing methodologies, possibly incorporating code-line execution coverage or other verifiable metrics. Third, the improved accuracy and robustness achieved by , as seen in Table 1, come with an increase in computational effort. Generating candidate programs, sampling unit tests, and executing them on generated images introduce additional overhead. This trade-off between accuracy and efficiency presents an exciting challenge for future research to optimize the framework for real-time or resource-constrained applications, possibly through algorithmic improvements or efficient execution strategies. Additionally, enhancing the explainability of program failures remains an area for further development. Providing clear and interpretable feedback when a program is rejected or not selected due to poor performance on unit tests can improve user trust and facilitate debugging. Future work could focus on combining unit test outputs to offer detailed explanations of program failures. Finally, while has demonstrated effectiveness on VQA and ITM tasks, exploring its applicability to other domains or tasks involving different modalities or reasoning paradigms presents an opportunity to extend its impact. Adapting the framework to diverse domains can unlock new possibilities and broaden its utility. Despite these limitations, the advancements introduced by lay a strong foundation for future innovations in visual programming. By addressing these challenges, we can further enhance the robustness, efficiency, and applicability of the framework.

I.2 Social Ethics Impact

enhances the robustness and correctness of visual programming, with applications in critical domains like autonomous driving, healthcare, and education. By reducing instances where programs are correct for the wrong reasons, it helps build more trustworthy AI systems. However, ethical considerations are crucial for its responsible deployment: First, relies on pre-trained models, which may propagate biases (e.g., gender, racial, or cultural). Future work should focus on integrating bias detection and correction into unit test generation to promote fairness. Second, computational demands may limit access for resource-constrained organizations. Advancing efficiency and optimization can broaden accessibility and foster inclusivity. Third, increased computational needs may raise energy consumption. Optimizing for energy efficiency and using renewable energy can reduce the environmental impact, while improved AI reliability could deliver long-term sustainability benefits. Finally, in sensitive domains like healthcare or law, rigorous validation and transparency are essential. Finally, in sensitive domains such as healthcare or legal decision-making, while has the potential to enhance the correctness of visual programs, it is crucial to carefully communicate the framework’s limitations and ensure rigorous validation. By proactively addressing ethical challenges and focusing on responsible development, we can maximize the positive societal impact of , paving the way for more reliable, fair, and trustworthy AI systems.

Appendix J Qualitative Examples

We present two program selection examples in Figures 19 and 20.

Listing 1: API Prompt

⬇

import math

class ImagePatch:

pass

def __init__(

self, image, left=None, lower=None, right=None, upper=None, category=None

"""Initializes an ImagePatch object by cropping the image at the given

coordinates and stores the coordinates as attributes. If no coordinates are

provided, the image is left unmodified, and the coordinates are set to the

dimensions of the image.

Parameters

-------

image : array_like

An array-like of the original image.

left, lower, right, upper : int

An int describing the position of the (left/lower/right/upper) border of the

crop’s bounding box in the original image.

category : str

A string describing the name of the object in the image."""

# Rectangles are represented as 4-tuples, (x1, y1, x2, y2),

# with the upper left corner given first. The coordinate

# system is assumed to have its origin in the upper left corner, so

# upper must be less than lower and left must be less than right.

self.left = left if left is not None else 0

self.lower = lower if lower is not None else image.height

self.right = right if right is not None else image.width

self.upper = upper if upper is not None else 0

self.cropped_image = image[:, image.shape[1]-upper:image.shape[1]-lower, left:right]

self.horizontal_center = (self.left + self.right) / 2

self.vertical_center = (self.upper + self.lower) / 2

self.category = category

def from_bounding_box(cls, image, bounding_box):

"""Initializes an ImagePatch object by cropping the image at the given

coordinates and stores the coordinates as attributes.

Parameters

-------

image : array_like

An array-like of the original image.

bounding_box : dict

A dictionary like {"box": [left, lower, right, upper], "category": str}."""

pass

@property

def area(self):

"""

Returns the area of the bounding box.

Examples

--------

>>> # What color is the largest foo?

>>> def execute_command(image) -> str:

>>> image_patch = ImagePatch(image)

>>> foo_patches = image_patch.find("foo")

>>> foo_patches.sort(key=lambda x: x.area)

>>> largest_foo_patch = foo_patches[-1]

>>> return largest_foo_patch.simple_query("What is the color?")

"""

pass

def find(self, object_name):

"""Returns a list of ImagePatch objects matching object_name contained in the

crop if any are found.

Otherwise, returns an empty list.

Parameters

----------

object_name : str

the name of the object to be found

Returns

-------

List[ImagePatch]

a list of ImagePatch objects matching object_name contained in the crop

Examples

--------

>>> # return the foo

>>> def execute_command(image) -> List[ImagePatch]:

>>> image_patch = ImagePatch(image)

>>> foo_patches = image_patch.find("foo")

>>> return foo_patches

"""

pass

def exists(self, object_name):

"""Returns True if the object specified by object_name is found in the image,

and False otherwise.

Parameters

-------

object_name : str

A string describing the name of the object to be found in the image.

Examples

-------

>>> # Are there both foos and garply bars in the photo?

>>> def execute_command(image)->str:

>>> image_patch = ImagePatch(image)

>>> is_foo = image_patch.exists("foo")

>>> is_garply_bar = image_patch.exists("garply bar")

>>> return bool_to_yesno(is_foo and is_garply_bar)

"""

pass

def verify_property(self, object_name, visual_property):

"""Returns True if the object possesses the visual property, and False otherwise.

Differs from ’exists’ in that it presupposes the existence of the object

specified by object_name, instead checking whether the object possesses

the property.

Parameters

-------

object_name : str

A string describing the name of the object to be found in the image.

visual_property : str

String describing the simple visual property (e.g., color, shape, material)

to be checked.

Examples

-------

>>> # Do the letters have blue color?

>>> def execute_command(image) -> str:

>>> image_patch = ImagePatch(image)

>>> letters_patches = image_patch.find("letters")

>>> # Question assumes only one letter patch

>>> return bool_to_yesno(letters_patches[0].verify_property("letters", "blue"))

"""

pass

def simple_query(self, question):

"""Returns the answer to a basic question asked about the image.

If no question is provided, returns the answer to "What is this?".

The questions are about basic perception, and are not meant to be used for

complex reasoning or external knowledge.

Parameters

-------

question : str

A string describing the question to be asked.

Examples

-------

>>> # Which kind of baz is not fredding?

>>> def execute_command(image) -> str:

>>> image_patch = ImagePatch(image)

>>> baz_patches = image_patch.find("baz")

>>> for baz_patch in baz_patches:

>>> if not baz_patch.verify_property("baz", "fredding"):

>>> return baz_patch.simple_query("What is this baz?")

>>> # What color is the foo?

>>> def execute_command(image) -> str:

>>> image_patch = ImagePatch(image)

>>> foo_patches = image_patch.find("foo")

>>> foo_patch = foo_patches[0]

>>> return foo_patch.simple_query("What is the color?")

>>> # Is the second bar from the left quuxy?

>>> def execute_command(image) -> str:

>>> image_patch = ImagePatch(image)

>>> bar_patches = image_patch.find("bar")

>>> bar_patches.sort(key=lambda x: x.horizontal_center)

>>> bar_patch = bar_patches[1]

>>> return bar_patch.simple_query("Is the bar quuxy?")

"""

pass

def crop_left_of_bbox(self, left, lower, right, upper):

"""Returns an ImagePatch object representing the area to the left of the given

bounding box coordinates.

Parameters

----------

left, lower, right, upper : int

The coordinates of the bounding box.

Returns

-------

ImagePatch

An ImagePatch object representing the cropped area.

Examples

--------

>>> # Is the bar to the left of the foo quuxy?

>>> def execute_command(image) -> str:

>>> image_patch = ImagePatch(image)

>>> foo_patch = image_patch.find("foo")[0]

>>> left_of_foo_patch = image_patch.crop_left_of_bbox(

>>> foo_patch.left, foo_patch.lower, foo_patch.right, foo_patch.upper

>>> )

>>> return bool_to_yesno(left_of_foo_patch.verify_property("bar", "quuxy"))

"""

pass

def crop_right_of_bbox(self, left, lower, right, upper):

"""Returns an ImagePatch object representing the area to the right of the given

bounding box coordinates.

Parameters

----------

left, lower, right, upper : int

The coordinates of the bounding box.

Returns

-------

ImagePatch

An ImagePatch object representing the cropped area.

Examples

--------

>>> # Is the bar to the right of the foo quuxy?

>>> def execute_command(image) -> str:

>>> image_patch = ImagePatch(image)

>>> foo_patch = image_patch.find("foo")[0]

>>> right_of_foo_patch = image_patch.crop_right_of_bbox(

>>> foo_patch.left, foo_patch.lower, foo_patch.right, foo_patch.upper

>>> )

>>> return bool_to_yesno(right_of_foo_patch.verify_property("bar", "quuxy"))

"""

pass

def crop_below_bbox(self, left, lower, right, upper):

"""Returns an ImagePatch object representing the area below the given

bounding box coordinates.

Parameters

----------

left, lower, right, upper : int

The coordinates of the bounding box.

Returns

-------

ImagePatch

An ImagePatch object representing the cropped area.

Examples

--------

>>> # Is the bar below the foo quuxy?

>>> def execute_command(image) -> str:

>>> image_patch = ImagePatch(image)

>>> foo_patch = image_patch.find("foo")[0]

>>> below_foo_patch = image_patch.crop_below_bbox(

>>> foo_patch.left, foo_patch.lower, foo_patch.right, foo_patch.upper

>>> )

>>> return bool_to_yesno(below_foo_patch.verify_property("bar", "quuxy"))

"""

pass

def crop_above_bbox(self, left, lower, right, upper):

"""Returns an ImagePatch object representing the area above the given

bounding box coordinates.

Parameters

----------

left, lower, right, upper : int

The coordinates of the bounding box.

Returns

-------

ImagePatch

An ImagePatch object representing the cropped area.

Examples

--------

>>> # Is the bar above the foo quuxy?

>>> def execute_command(image) -> str:

>>> image_patch = ImagePatch(image)

>>> foo_patch = image_patch.find("foo")[0]

>>> above_foo_patch = image_patch.crop_above_bbox(

>>> foo_patch.left, foo_patch.lower, foo_patch.right, foo_patch.upper

>>> )

>>> return bool_to_yesno(above_foo_patch.verify_property("bar", "quuxy"))

"""

pass

def best_image_match(list_patches: List[ImagePatch], content: List[str], return_index=False) -> Union[ImagePatch, int]:

"""Returns the patch most likely to contain the content.

Parameters

----------

list_patches : List[ImagePatch]

content : List[str]

the object of interest

return_index : bool

if True, returns the index of the patch most likely to contain the object

Returns

-------

int

Patch most likely to contain the object

"""

return best_image_match(list_patches, content, return_index)

def bool_to_yesno(bool_answer: bool) -> str:

return "yes" if bool_answer else "no"

Write a function using Python and the ImagePatch class (above) that could be executed to provide an answer to the query.

Consider the following guidelines:

- Use base Python (comparison, sorting) for basic logical operations, left/right/up/down, math, etc.

# Examples of how to use the API

INSERT_CONTEXT_HERE

Query: INSERT_QUERY_HERE

Program:

Listing 2: ITM In-Context Examples

⬇

# Query: Verify image matches text="An airplane is flying in the sky, and birds are flying below it."

def execute_command(image) -> str:

image_patch = ImagePatch(image)

airplane_patches = image_patch.find("airplane")

bird_patches = image_patch.find("bird")

airplane_in_sky = any(

airplane_patch.vertical_center > image_patch.height * 0.6

for airplane_patch in airplane_patches

)

birds_below_airplane = any(

bird_patch.upper <= airplane_patch.lower

for bird_patch in bird_patches for airplane_patch in airplane_patches

)

return bool_to_yesno(airplane_in_sky and birds_below_airplane)

# Query: Verify image matches text="The bird is flying above the tree, and a cat is sitting under the tree."

def execute_command(image) -> str:

image_patch = ImagePatch(image)

bird_patches = image_patch.find("bird")

tree_patches = image_patch.find("tree")

cat_patches = image_patch.find("cat")

bird_above_tree = any(

bird_patch.lower >= tree_patch.upper and

abs(bird_patch.horizontal_center - tree_patch.horizontal_center) < 50

for bird_patch in bird_patches for tree_patch in tree_patches

)

cat_under_tree = any(

cat_patch.upper <= tree_patch.lower and

abs(cat_patch.horizontal_center - tree_patch.horizontal_center) < 50

for cat_patch in cat_patches for tree_patch in tree_patches

)

return bool_to_yesno(bird_above_tree and cat_under_tree)

# Query: Verify image matches text="The apple is on top of the book, and the pen is beside the book."

def execute_command(image) -> str:

image_patch = ImagePatch(image)

apple_patches = image_patch.find("apple")

book_patches = image_patch.find("book")

pen_patches = image_patch.find("pen")

apple_on_book = any(

apple_patch.lower >= book_patch.upper and

book_patch.left <= apple_patch.horizontal_center <= book_patch.right

for apple_patch in apple_patches for book_patch in book_patches

)

pen_beside_book = any(

abs(pen_patch.horizontal_center - book_patch.horizontal_center) < 50 and

abs(pen_patch.vertical_center - book_patch.vertical_center) < 100

for pen_patch in pen_patches for book_patch in book_patches

)

return bool_to_yesno(apple_on_book and pen_beside_book)

#Query: Verify image matches text="A man is riding a bicycle, and a dog is running beside him."

def execute_command(image) -> str:

image_patch = ImagePatch(image)

man_patches = image_patch.find("man")

bicycle_patches = image_patch.find("bicycle")

dog_patches = image_patch.find("dog")

man_on_bicycle = any(

man_patch.left <= bicycle_patch.right and man_patch.right >= bicycle_patch.left and

man_patch.lower <= bicycle_patch.upper and man_patch.upper >= bicycle_patch.lower

for man_patch in man_patches for bicycle_patch in bicycle_patches

)

dog_beside_man = any(

abs(dog_patch.horizontal_center - man_patch.horizontal_center) < 100 and

abs(dog_patch.vertical_center - man_patch.vertical_center) < 50

for dog_patch in dog_patches for man_patch in man_patches

)

return bool_to_yesno(man_on_bicycle and dog_beside_man)

Listing 3: VQA In-Context Examples

⬇

# Query: Is the vehicle in the top of the image?

def execute_command(image) -> str:

image_patch = ImagePatch(image)

# Assume there’s only one vehicle patch.

vehicle_patch = image_patch.find("vehicle")[0]

vehicle_in_top_half = vehicle_patch.vertical_center > image_patch.vertical_center

return bool_to_yesno(vehicle_in_top_half)

# Query: Are there trains or fences in this scene?

def execute_command(image) -> str:

image_patch = ImagePatch(image)

trains = image_patch.find("train")

fences = image_patch.find("fence")

has_trains_or_fences = len(trains) > 0 or len(fences) > 0

return bool_to_yesno(has_trains_or_fences)

# Query: Is the pillow in the top part or in the bottom of the picture?

def execute_command(image) -> str:

image_patch = ImagePatch(image)

pillow_patches = image_patch.find("pillow")

pillow_patch = pillow_patches[0]

pillow_in_top_half = pillow_patch.vertical_center > image_patch.vertical_center

if pillow_in_top_half:

return "top"

else:

return "bottom"

# Query: What color is the curtain that is to the right of the mirror?

def execute_command(image) -> str:

image_patch = ImagePatch(image)

mirror_patches = image_patch.find("mirror")

mirror_patch = mirror_patches[0]

right_of_mirror_patch = image_patch.crop_right_of_bbox(

mirror_patch.left, mirror_patch.lower, mirror_patch.right, mirror_patch.upper

)

return right_of_mirror_patch.simple_query("What color is the curtain?")

Listing 4: Reprompting with Unit Tests VQA

⬇

INSERT_IMAGE_PATCH_API

You are provided a Python program that answers a query about an image, with a set of tests with the corresponding outputs and exected responses.

Correct the Python program such that it passes the tests.

- Ensure the corrected program is different than the incorrect program provided.

Query: Is there a blue chair in the image?

Incorrect Program:

def execute_command(image):

image_patch = ImagePatch(image)

blue_chair = image_patch.find("chair")

if not blue_chair:

return "No"

is_blue = any([chair.verify_property("blue") for chair in blue_chair])

return "Yes" if is_blue else "No"

Test Cases:

Test A

Image Content: "A room with a red chair"

Ground Truth Answer: "No"

Program Output: "Error: verify_property() missing 1 required positional argument: ’visual_property’"

Test B

Image Content: "A room with a blue chair under the window"

Ground Truth Answer: "Yes"

Program Output: "Error: verify_property() missing 1 required positional argument: ’visual_property’"

Test C

Image Content: "An empty room"

Ground Truth Answer: "No"

Program Output: "No"

Test D

Image Content: "A garden with a blue chair"

Ground Truth Answer: "Yes"

Program Output: "Error: verify_property() missing 1 required positional argument: ’visual_property’"

Test E

Image Content: "A room with several chairs, all red"

Ground Truth Answer: "No"

Program Output: "Error: verify_property() missing 1 required positional argument: ’visual_property’"

Corrected Program:

def execute_command(image):

image_patch = ImagePatch(image)

chair_patches = image_patch.find("chair")

if not chair_patches:

return "No" # No chairs found

blue_chair_found = any(chair.verify_property("chair", "blue") for chair in chair_patches)

return "Yes" if blue_chair_found else "No"

Query: "Are there any flowers to the left of the house?"

Incorrect Program:

def execute_command(image):

image_patch = ImagePatch(image)

house_patches = image_patch.find("house")

if not house_patches:

return "No house found"

left_of_house_patch = image_patch.crop_left_of_bbox(

house_patches.left, house_patches.lower, house_patches.right, house_patches.upper

) # Incorrect attribute access

return "Yes" if left_of_house_patch.exists("flower") else "No"

Test Cases:

Test A

Image Content: "An image of a garden without any buildings."

Ground Truth Answer: "No house found"

Program Output: "Error: ’list’ object has no attribute ’left’"

Test B

Image Content: "A house without a garden"

Ground Truth Answer: "No flowers found"

Program Output: "Error: ’list’ object has no attribute ’left’"

Test C

Image Content: "A house with many flowers around"

Ground Truth Answer: "Yes"

Program Output: "Error: ’list’ object has no attribute ’left’"

Test D

Image Content: "A house with flowers only on the right side"

Ground Truth Answer: "No"

Program Output: "Error: ’list’ object has no attribute ’left’"

Test E

Image Content: "An image with flowers but no house"

Ground Truth Answer: "No house found"

Program Output: "Error: ’list’ object has no attribute ’left’"

Corrected Program:

def execute_command(image):

image_patch = ImagePatch(image)

house_patches = image_patch.find("house")

if not house_patches:

return "No house found"

for house_patch in house_patches:

left_of_house_patch = image_patch.crop_left_of_bbox(

house_patch.left, house_patch.lower, house_patch.right, house_patch.upper

)

flowers_found = left_of_house_patch.find("flower")

if flowers_found:

return "Yes"

return "No"

Query: Who wears a green shirt?

Incorrect Program:

def execute_command(image):

image_patch = ImagePatch(image)

people_patches = image_patch.find("person")

if not people_patches:

return "No one"

person_wearing_green_shirt = None

for index, person_patch in enumerate(people_patches):

green_patches = person_patch.find("green")

if green_patches:

person_wearing_green_shirt = index

break

if person_wearing_green_shirt == None:

return "No one"

else:

return people_patches[person_wearing_green_shirt].simple_query("Who is this?")

Test Cases:

Test A

Image Content: "An image of an empty room"

Ground Truth Answer: "No one"

Program Output: "No one"

Test B

Image Content: "A young girl wearing a green dress and a boy wearing a blue shirt."

Ground Truth Answer: "No one"

Program Output: "girl"

Test C

Image Content: "A man wearing a red shirt and a woman wearing a green shirt."

Ground Truth Answer: "woman"

Program Output: "woman"

Test D

Image Content: "A boy wearing a green shirt."

Ground Truth Answer: "boy"

Program Output: "boy"

Test E

Image Content: "Two people wearing green shirts: a man and a woman"

Ground Truth Answer: "man and woman"

Program Output: "man"

Corrected Program:

def execute_command(image):

image_patch = ImagePatch(image)

people_patches = image_patch.find("person")

if not people_patches:

return "No people found"

people_wearing_green_shirts = []

for index, person_patch in enumerate(people_patches):

if person_patch.verify_property("clothing", "shirt") and person_patch.verify_property("color", "green"):

people_wearing_green_shirts.append(index)

if not people_wearing_green_shirts:

return "No one"

wearing_green_shirts = ’ and ’.join([people_patches[i].simple_query("Who is this?") for i in people_wearing_green_shirts])

return wearing_green_shirts

Query: "Is the blue ball inside the box?"

Incorrect Program:

def execute_command(image):

image_patch = ImagePatch(image)

ball_patches = image_patch.find("ball")

if not ball_patches:

return "No"

blue_ball = [ball for ball in ball_patches if ball.verify_property("color", "blue")]

if not blue_ball:

return "No" # No blue ball found

box_patches = image_patch.find("box")

if not box_patches:

return "No"

return "Yes"

Test Cases:

Test A

Image Content: "A blue ball is outside a box"

Ground Truth Answer: "No"

Program Output: "Yes"

Test B

Image Content: "A red ball is inside a box"

Ground Truth Answer: "No"

Program Output: "No"

Test C

Image Content: "A blue ball is inside a box"

Ground Truth Answer: "Yes"

Program Output: "Yes"

Test D

Image Content: "No balls or boxes in the image"

Ground Truth Answer: "No"

Program Output: "No"

Test E

Image Content: "Multiple blue balls, all outside boxes"

Ground Truth Answer: "No"

Program Output: "Yes"

Corrected Program:

def execute_command(image):

image_patch = ImagePatch(image)

ball_patches = image_patch.find("ball")

if not ball_patches:

return "No" # No ball found

blue_ball = [ball for ball in ball_patches if ball.verify_property("color", "blue")]

if not blue_ball:

return "No" # No blue ball found

box_patches = image_patch.find("box")

if not box_patches:

return "No" # No box found

blue_ball_patch = blue_ball[0]

for box_patch in box_patches:

if (box_patch.left <= blue_ball_patch.left and

box_patch.right >= blue_ball_patch.right and

box_patch.upper <= blue_ball_patch.upper and

box_patch.lower >= blue_ball_patch.lower):

return "Yes"

return "No"

Query: INSERT_QUERY_HERE

Incorrect Program:

INSERT_CODE_HERE

Test Cases:

INSERT_UNIT_TEST_OUTPUTS_HERE

Corrected Program:

Listing 5: Reprompting with Unit Tests ITM

⬇

INSERT_IMAGE_PATCH_API

You are provided a Python program that answers a query about an image, with a set of tests with the corresponding outputs and exected responses.

Correct the Python program such that it passes the tests.

- Ensure the corrected program is different than the incorrect program provided.

Query: "Verify image matches text="An airplane is flying in the sky, and birds are flying below it.""

Incorrect Program:

def execute_command(image):

image_patch = ImagePatch(image)

airplane = image_patch.find("airplane")

birds = image_patch.find("birds")

if not airplane or not birds:

return "No"

if airplane[0].vertical_center >= birds[0].vertical_center:

return "Yes"

return "No"

Test Cases:

Test A

Image Content: "An airplane flying high in the sky with birds below it."

Ground Truth Answer: "Yes"

Program Output: "Yes"

Test B

Image Content: "Birds are flying above and below an airplane in the sky."

Ground Truth Answer: "No"

Program Output: "Yes"

Test C

Image Content: "An airplane and birds flying side by side."

Ground Truth Answer: "No"

Program Output: "Yes"

Test D

Image Content: "Only an airplane is flying in the sky."

Ground Truth Answer: "No"

Program Output: "No"

Test E

Image Content: "Birds flying in the sky with no airplane present."

Ground Truth Answer: "No"

Program Output: "No"

Corrected Program::

def execute_command(image):

image_patch = ImagePatch(image)

airplane_patches = image_patch.find("airplane")

bird_patches = image_patch.find("bird")

if not airplane_patches or not bird_patches:

return "No"

airplane = airplane_patches[0]

birds_below = all(bird.vertical_center > airplane.vertical_center for bird in bird_patches)

return "Yes" if birds_below else "No"

Query: "Verify image matches text="The bird is flying above the tree, and a cat is sitting under the tree.""

Incorrect Program:

def execute_command(image):

image_patch = ImagePatch(image)

tree = image_patch.find("tree")

bird = image_patch.find("bird")

cat = image_patch.find("cat")

if not tree or not bird or not cat:

return "No"

if bird[0].vertical_center < tree[0].vertical_center and cat[0].vertical_center > tree[0].vertical_center:

return "Yes"

return "No"

Test Cases:

Test A

Image Content: "A bird flying above a tree and a cat under the tree."

Ground Truth Answer: "Yes"

Program Output: "Yes"

Test B

Image Content: "A cat sitting above the tree and a bird flying below it."

Ground Truth Answer: "No"

Program Output: "Yes"

Test C

Image Content: "A bird sitting in the tree with no cat around."

Ground Truth Answer: "No"

Program Output: "No"

Test D

Image Content: "A cat climbing the tree while a bird flies overhead."

Ground Truth Answer: "No"

Program Output: "Yes"

Test E

Image Content: "A bird flying above a tree with a dog under the tree."

Ground Truth Answer: "No"

Program Output: "No"

Corrected Program:

def execute_command(image):

image_patch = ImagePatch(image)

tree_patches = image_patch.find("tree")

bird_patches = image_patch.find("bird")

cat_patches = image_patch.find("cat")

if not tree_patches or not bird_patches or not cat_patches:

return "No"

tree = tree_patches[0]

bird_above = all(bird.vertical_center < tree.vertical_center for bird in bird_patches)

cat_below = all(cat.vertical_center > tree.vertical_center for cat in cat_patches)

return "Yes" if bird_above and cat_below else "No"

Query: "Verify image matches text="A car is parked near a tree, and a bird is sitting on the tree.""

Incorrect Program:

def execute_command(image):

image_patch = ImagePatch(image)

car = image_patch.find("car")

tree = image_patch.find("tree")

bird = image_patch.find("bird")

if not car or not tree or not bird:

return "No"

if car.horizontal_center - tree.horizontal_center < 100 and bird.vertical_center < tree.vertical_center:

return "Yes"

return "No"

Test Cases:

Test A

Image Content: "A car parked near a tree with a bird sitting on it."

Ground Truth Answer: "Yes"

Program Output: AttributeError: ’list’ object has no attribute ’horizontal_center’

Test B

Image Content: "A car far from a tree with a bird on the ground."

Ground Truth Answer: "No"

Program Output: AttributeError: ’list’ object has no attribute ’horizontal_center’

Test C

Image Content: "A tree with a bird on it but no car nearby."

Ground Truth Answer: "No"

Program Output: "No"

Test D

Image Content: "A car parked near a tree with no bird in sight."

Ground Truth Answer: "No"

Program Output: AttributeError: ’list’ object has no attribute ’horizontal_center’

Test E

Image Content: "A car and a bird but no tree present."

Ground Truth Answer: "No"

Program Output: AttributeError: ’list’ object has no attribute ’horizontal_center’

Corrected Program:

def execute_command(image):

image_patch = ImagePatch(image)

car_patches = image_patch.find("car")

tree_patches = image_patch.find("tree")

bird_patches = image_patch.find("bird")

if not car_patches or not tree_patches or not bird_patches:

return "No"

car = car_patches[0]

tree = tree_patches[0]

bird = bird_patches[0]

car_near_tree = abs(car.horizontal_center - tree.horizontal_center) < 100

bird_on_tree = bird.vertical_center < tree.vertical_center

return "Yes" if car_near_tree and bird_on_tree else "No"

Query: "Verify image matches text="A man is holding a red balloon, and a child is reaching up to grab it.""

Incorrect Program:

def execute_command(image):

image_patch = ImagePatch(image)

man = image_patch.find("man")

balloon = image_patch.find("balloon")

child = image_patch.find("child")

if not man or not balloon or not child:

return "No"

if balloon[0].verify_property("red") and child[0].vertical_center < balloon[0].vertical_center:

return "Yes"

return "No"

Test Cases:

Test A

Image Content: "A man holding a red balloon, with a child reaching up."

Ground Truth Answer: "Yes"

Program Output: TypeError: verify_property() missing 1 required positional argument: ’visual_property’

Test B

Image Content: "A man holding a blue balloon, with a child below him."

Ground Truth Answer: "No"

Program Output: TypeError: verify_property() missing 1 required positional argument: ’visual_property’

Test C

Image Content: "A man holding a flower, with a child next to him."

Ground Truth Answer: "No"

Program Output: "No"

Corrected Program:

def execute_command(image):

image_patch = ImagePatch(image)

man_patches = image_patch.find("man")

balloon_patches = image_patch.find("balloon")

child_patches = image_patch.find("child")

if not man_patches or not balloon_patches or not child_patches:

return "No"

balloon = balloon_patches[0]

is_red_balloon = balloon.verify_property("balloon", "red")

child_below_balloon = all(child.vertical_center < balloon.vertical_center for child in child_patches)

return "Yes" if is_red_balloon and child_below_balloon else "No"

Query: "Verify image matches text="A cat is sitting on the table, and a book is lying beside it.""

Incorrect Program:

def execute_command(image):

image_patch = ImagePatch(image)

cat = image_patch.find("cat")

book = image_patch.find("book")

if not cat or not book:

return "No"

if abs(book[0].horizontal_center - cat[0].horizontal_center) < 50:

return "Yes"

return "No"

Test Cases:

Test A

Image Content: "A cat sitting on the table with a book beside it."

Ground Truth Answer: "Yes"

Program Output: "Yes"

Test B

Image Content: "A cat sitting on the floor with a book beside it."

Ground Truth Answer: "No"

Program Output: "Yes"

Test C

Image Content: "A cat sitting on the table with no book around."

Ground Truth Answer: "No"

Program Output: "No"

Test D

Image Content: "A book lying on the table with no cat in sight."

Ground Truth Answer: "No"

Program Output: "No"

Test E

Image Content: "A cat sitting on the table with a book on the floor."

Ground Truth Answer: "No"

Program Output: "Yes"

Corrected Program:

def execute_command(image):

image_patch = ImagePatch(image)

cat_patches = image_patch.find("cat")

book_patches = image_patch.find("book")

table_patches = image_patch.find("table")

if not cat_patches or not book_patches or not table_patches:

return "No"

cat = cat_patches[0]

book = book_patches[0]

table = table_patches[0]

is_cat_on_table = cat.vertical_center < table.vertical_center and abs(cat.horizontal_center - table.horizontal_center) < 50

is_book_beside_cat = abs(book.horizontal_center - cat.horizontal_center) < 50

return "Yes" if is_cat_on_table and is_book_beside_cat else "No"

Query: INSERT_QUERY_HERE

Incorrect Program:

INSERT_CODE_HERE

Test Cases:

INSERT_UNIT_TEST_OUTPUTS_HERE

Corrected Program:

Listing 6: VQA Unit Test Generation In Context Examples

⬇

Query: Is there a cat or dog in the image?

Tests:

1. Image Caption: "A grey tabby cat peacefully napping on a plush sofa" Answer: yes

2. Image Caption: "A lively golden retriever bounding across a grassy field in the park" Answer: yes

3. Image Caption: "Twin Siamese cats playfully swatting at a bright yellow ball" Answer: yes

4. Image Caption: "A cluster of wild horses trotting along the sandy shores of a sunlit beach" Answer: no

5. Image Caption: "An orange cat and a black Labrador playfully tugging on a rope toy" Answer: yes

6. Image Caption: "A modern living room featuring sleek furniture and devoid of any pets" Answer: no

Query: Is there a red truck or bus in the image?

Tests:

1. Image Caption: "A vibrant red Ford pickup parked beside a country road" Answer: yes

2. Image Caption: "A red double-decker bus navigating through a busy downtown street" Answer: yes

3. Image Caption: "A large blue semi-truck cruising down an interstate highway" Answer: no

4. Image Caption: "A quiet suburban street devoid of any large vehicles like buses or trucks" Answer: no

5. Image Caption: "A shiny red Ferrari speeding on a professional race track" Answer: no

6. Image Caption: "An array of red delivery trucks lined up in a distribution center parking lot" Answer: yes

7. Image Caption: "Several bright yellow school buses parked in a row at a local school" Answer: no

Query: What color is the largest car in the image?

Tests:

1. Image Caption: "A large blue Ford pickup truck driving on a busy highway" Answer: blue

2. Image Caption: "A city street empty of any large vehicles like buses or trucks" Answer: no answer

3. Image Caption: "A row of green food trucks serving lunch in an urban park" Answer: green

4. Image Caption: "A scene with a green public bus next to a smaller blue pickup at an intersection" Answer: green

Query: Is the vase to the left or right of the center?

Tests:

1. Image Caption: "A delicate porcelain vase positioned on the right end of a mahogany dining table" Answer: right

2. Image Caption: "A tall glass vase sitting on the left side of a neatly made bed in a sunlit room" Answer: left

3. Image Caption: "A ceramic vase centrally placed on a round table surrounded by chairs" Answer: center

Query: What is the highest object in the image?

Tests:

1. Image Caption: "A massive skyscraper dominating the skyline among lower city buildings" Answer: skyscraper

2. Image Caption: "A lone oak tree surpassing the height of the cottage it stands next to" Answer: tree

3. Image Caption: "Colorful balloons drifting above the treetops in a clear sky" Answer: balloons

4. Image Caption: "A commercial jet flying high above the city’s tallest skyscrapers" Answer: plane

5. Image Caption: "A majestic eagle soaring high above a vast canyon landscape" Answer: eagle

6. Image Caption: "A figure standing on the peak of a grassy hill under a blue sky" Answer: person

Query: INSERT_QUERY_HERE

Tests:

Listing 7: ITM Unit Test Generation In Context Examples

⬇

Query: Is the drawing of a tree on the hill, and a river that flows at the bottom of the hill?

Tests:

1. Image Caption: "A solitary tree stands atop a gentle hill, with a flowing river winding below it." Answer: yes

2. Image Caption: "A tree on a grassy hill under a clear sky." Answer: no

3. Image Caption: "A river meandering through a dense forest of tall trees." Answer: no

4. Image Caption: "A panoramic view of rolling hills in the desert, with a river at the bottom." Answer: no

5. Image Caption: "A vast plain with a river running through fields of wildflowers." Answer: no

6. Image Caption: Image Caption: "A hill with multiple trees and a river flowing nearby." Answer: yes

Query: Is the drawing of an airplane flying in the sky, and birds flying below it?

Tests:

1. Image Caption: "An airplane soars through the sky, with a flock of birds flying beneath it." Answer: yes

2. Image Caption: "Birds flying over a tranquil lake under a clear sky." Answer: no

3. Image Caption: "An airplane performing aerobatic maneuvers, with birds flying above it." Answer: no

4. Image Caption: "An airplane floating in the sea with birds flying above it." Answer: Yes

5. Image Caption: "An airplane in a clear sky" Answer: no

Query: Is the drawing of a girl holding an umbrella in the rain?

Tests:

1. Image Caption: "A girl holding an umbrella walks through a rainy street." Answer: yes

2. Image Caption: "A girl holds an umbrella under a bright sun in the park." Answer: no

3. Image Caption: "A girl stands in the rain wearing a colorful raincoat and holding flowers." Answer: no

4. Image Caption: "A girl walks her dog while holding an umbrella on a rainy day." Answer: yes

Query: Is the drawing of a person sitting at a desk with a computer monitor in front of them?

Tests:

1. Image Caption: "A person sitting at a desk, writing in a notebook with a lamp beside them." Answer: no

2. Image Caption: "Someone sitting at a desk cluttered with papers and a computer monitor." Answer: yes

3. Image Caption: "Someone sitting at a desk cluttered with papers and a computer monitor." Answer: yes

3. Image Caption: "A person with a big computer screen in the background" Answer: no

Query: Is the drawing of a man riding a bicycle, and a dog running beside him?

Tests:

1. Image Caption: "A man cycling alone on a mountain trail surrounded by trees." Answer: no

2. Image Caption: "A man rides a bicycle along the beach, his dog running beside him." Answer: yes

3. Image Caption: "A bicycle and a dog" Answer: no

4. Image Caption: "A dog next to a car" Answer: no

5. Image Caption: "A man walking his dog" Answer: no

6. Image Caption: "A man rides a bicycle down a sunny street with a dog running beside him." Answer: yes

Query: INSERT_QUERY_HERE

Tests:

Listing 8: VQA Unit Test Generation with Implementation In-Context Examples

⬇

# Query: Is there a cat or dog in the image?

def execute_command(image) -> str:

image_patch = ImagePatch(image)

cats = image_patch.find("cat")

dogs = image_patch.find("dog")

has_cats_or_dogs = len(cats) > 0 or len(dogs) > 0

return bool_to_yesno(has_cats_or_dogs)

Tests:

1. Image Caption: "A grey tabby cat peacefully napping on a plush sofa" Answer: yes

2. Image Caption: "A lively golden retriever bounding across a grassy field in the park" Answer: yes

3. Image Caption: "Twin Siamese cats playfully swatting at a bright yellow ball" Answer: yes

4. Image Caption: "A cluster of wild horses trotting along the sandy shores of a sunlit beach" Answer: no

5. Image Caption: "An orange cat and a black Labrador playfully tugging on a rope toy" Answer: yes

6. Image Caption: "A modern living room featuring sleek furniture and devoid of any pets" Answer: no

# Query: Is there a red truck or bus in the image?

def execute_command(image) -> str:

image_patch = ImagePatch(image)

trucks = image_patch.find("truck")

buses = image_patch.find("bus")

red_trucks = [truck for truck in trucks if truck.verify_property("truck", "red")]

red_buses = [bus for bus in buses if bus.verify_property("bus", "red")]

has_red_trucks_or_buses = len(red_trucks) > 0 or len(red_buses) > 0

return bool_to_yesno(has_red_trucks_or_buses)

Tests:

1. Image Caption: "A vibrant red Ford pickup parked beside a country road" Answer: yes

2. Image Caption: "A red double-decker bus navigating through a busy downtown street" Answer: yes

3. Image Caption: "A large blue semi-truck cruising down an interstate highway" Answer: no

4. Image Caption: "A quiet suburban street devoid of any large vehicles like buses or trucks" Answer: no

5. Image Caption: "A shiny red Ferrari speeding on a professional race track" Answer: no

6. Image Caption: "An array of red delivery trucks lined up in a distribution center parking lot" Answer: yes

7. Image Caption: "Several bright yellow school buses parked in a row at a local school" Answer: no

# Query: What color is the largest car in the image?

def execute_command(image) -> str:

image_patch = ImagePatch(image)

car_patches = image_patch.find("car")

if not car_patches:

return "No cars found in the image."

# Sort cars by their area to find the largest one

car_patches.sort(key=lambda x: x.area, reverse=True)

largest_car_patch = car_patches[0]

color_of_largest_car = largest_car_patch.simple_query("What is the color?")

return color_of_largest_car

Tests:

1. Image Caption: "A large blue Ford pickup truck driving on a busy highway" Answer: blue

2. Image Caption: "A city street empty of any large vehicles like buses or trucks" Answer: no answer

3. Image Caption: "A row of green food trucks serving lunch in an urban park" Answer: green

4. Image Caption: "A scene with a green public bus next to a smaller blue pickup at an intersection" Answer: green

# Query: Is the vase to the left or right of the center?

def execute_command(image) -> str:

image_patch = ImagePatch(image)

vase_patches = image_patch.find("vase")

if not vase_patches:

return "No vases found in the image."

vase_patch = vase_patches[0]

vase_position = vase_patch.horizontal_center

image_center = (image_patch.left + image_patch.right) / 2

if vase_position < image_center:

return "left"

elif vase_position > image_center:

return "right"

else:

return "center"

Tests:

1. Image Caption: "A delicate porcelain vase positioned on the right end of a mahogany dining table" Answer: right

2. Image Caption: "A tall glass vase sitting on the left side of a neatly made bed in a sunlit room" Answer: left

3. Image Caption: "A ceramic vase centrally placed on a round table surrounded by chairs" Answer: center

# Query: What is the highest object in the image?

def execute_command(image) -> str:

image_patch = ImagePatch(image)

possible_objects = ["car", "tree", "building", "person", "vase", "animal", "vehicle", "furniture"]

all_patches = []

for obj in possible_objects:

all_patches.extend(image_patch.find(obj))

if not all_patches:

return "No objects found in the image."

highest_patch = max(all_patches, key=lambda x: x.upper)

highest_object_name = highest_patch.simple_query("What is this?")

return highest_object_name

Tests:

1. Image Caption: "A massive skyscraper dominating the skyline among lower city buildings" Answer: skyscraper

2. Image Caption: "A lone oak tree surpassing the height of the cottage it stands next to" Answer: tree

3. Image Caption: "Colorful balloons drifting above the treetops in a clear sky" Answer: balloons

4. Image Caption: "A commercial jet flying high above the city’s tallest skyscrapers" Answer: plane

5. Image Caption: "A majestic eagle soaring high above a vast canyon landscape" Answer: eagle

6. Image Caption: "A figure standing on the peak of a grassy hill under a blue sky" Answer: person

Create test cases for the specified query and program using the format provided in the examples.

The test cases should consist of image captions and answers to the query.

The answers should be consice, limited to a single word.

Query: INSERT_QUERY_HERE

Program:

INSERT_PROGRAM_HERE

Tests:

Listing 9: Example Code

⬇

I will provide you with a caption for a photo, image, or painting.

Your task is to generate the bounding boxes for the objects mentioned in the caption, along with a background prompt describing the scene.

The images are of size 512x512. The top-left corner has coordinate [0, 0].

The bottom-right corner has coordinnate [512, 512].

The bounding boxes should not overlap or go beyond the image boundaries.

Each bounding box should be in the format of (object name, [top-left x coordinate, top-left y coordinate, box width, box height]) and should not include more than one object.

Do not put objects that are already provided in the bounding boxes into the background prompt. Do not include non-existing or excluded objects in the background prompt.

Use "A realistic scene" as the background prompt if no background is given in the prompt. If needed, you can make reasonable guesses.

Please refer to the example below for the desired format.

Caption: A realistic image of landscape scene depicting a green car parking on the left of a blue truck, with a red air balloon and a bird in the sky

Objects: [(’a green car’, [21, 281, 211, 159]), (’a blue truck’, [269, 283, 209, 160]), (’a red air balloon’, [66, 8, 145, 135]), (’a bird’, [296, 42, 143, 100])]

Background prompt: A realistic landscape scene

Negative prompt: None

Caption: A realistic top-down view of a wooden table with two apples on it

Objects: [(’a wooden table’, [20, 148, 472, 216]), (’an apple’, [150, 226, 100, 100]), (’an apple’, [280, 226, 100, 100])]

Background prompt: A realistic top-down view

Negative prompt: None

Caption: A realistic scene of three skiers standing in a line on the snow near a palm tree

Objects: [(’a skier’, [5, 152, 139, 168]), (’a skier’, [278, 192, 121, 158]), (’a skier’, [148, 173, 124, 155]), (’a palm tree’, [404, 105, 103, 251])]

Background prompt: A realistic outdoor scene with snow

Negative prompt: None

Caption: An oil painting of a pink dolphin jumping on the left of a steam boat on the sea

Objects: [(’a steam boat’, [232, 225, 257, 149]), (’a jumping pink dolphin’, [21, 249, 189, 123])]

Background prompt: An oil painting of the sea

Negative prompt: None

Caption: A cute cat and an angry dog without birds

Objects: [(’a cute cat’, [51, 67, 271, 324]), (’an angry dog’, [302, 119, 211, 228])]

Background prompt: A realistic scene

Negative prompt: birds

Caption: Two pandas in a forest without flowers

Objects: [(’a panda’, [30, 171, 212, 226]), (’a panda’, [264, 173, 222, 221])]

Background prompt: A forest

Negative prompt: flowers

Caption: An oil painting of a living room scene without chairs with a painting mounted on the wall, a cabinet below the painting, and two flower vases on the cabinet

Objects: [(’a painting’, [88, 85, 335, 203]), (’a cabinet’, [57, 308, 404, 201]), (’a flower vase’, [166, 222, 92, 108]), (’a flower vase’, [328, 222, 92, 108])]

Background prompt: An oil painting of a living room scene

Negative prompt: chairs

Caption: INSERT_PROMPT_HERE

Objects:

Listing 10: Reprompting with Errors VQA

⬇

INSERT_IMAGE_PATCH_API

You are provided a Python program that answers a query about an image, with a set of tests with the corresponding outputs and exected responses.

Correct the Python program such that it passes the tests.

- Ensure the corrected program is different than the incorrect program provided.

Query: Is there a blue chair in the image?

Incorrect Program:

def execute_command(image):

image_patch = ImagePatch(image)

blue_chair = image_patch.find("chair")

if not blue_chair:

return "No"

is_blue = any([chair.verify_property("blue") for chair in blue_chair])

return "Yes" if is_blue else "No"

Error: verify_property() missing 1 required positional argument: ’visual_property

Corrected Program::

def execute_command(image):

image_patch = ImagePatch(image)

chair_patches = image_patch.find("chair")

if not chair_patches:

return "No" # No chairs found

blue_chair_found = any(chair.verify_property("chair", "blue") for chair in chair_patches)

return "Yes" if blue_chair_found else "No"

Query: "Are there any flowers to the left of the house?"

Incorrect Program:

def execute_command(image):

image_patch = ImagePatch(image)

house_patches = image_patch.find("house")

left_of_house_patch = image_patch.crop_left_of_bbox(

house_patches.left, house_patches.lower, house_patches.right, house_patches.upper

) # Incorrect attribute access

return "Yes" if left_of_house_patch.exists("flower") else "No"

Error: ’list’ object has no attribute ’left

Corrected Program:

def execute_command(image):

image_patch = ImagePatch(image)

house_patches = image_patch.find("house")

if not house_patches:

return "No house found"

house_patch = house_patches[0]

left_of_house_patch = image_patch.crop_left_of_bbox(

house_patch.left, house_patch.lower, house_patch.right, house_patch.upper

)

flowers_found = left_of_house_patch.find("flower")

return "Yes" if flowers_found else "No"

Query: Who wears a green shirt?

Incorrect Program:

def execute_command(image):

image_patch = ImagePatch(image)

# Incorrectly calling find() with an extra argument, leading to an error

people_patches = image_patch.find("person", "green")

if not people_patches:

return "No one"

people_wearing_green_shirts = []

for person_patch in people_patches:

if person_patch.verify_property("clothing", "shirt") and person_patch.verify_property("color", "green"):

people_wearing_green_shirts.append(person_patch)

if not people_wearing_green_shirts:

return "No one"

wearing_green_shirts = ’, ’.join([person.simple_query("Who is this?") for person in people_wearing_green_shirts])

return wearing_green_shirts

Error: find() takes 2 positional arguments but 3 were given

Corrected Program:

def execute_command(image):

image_patch = ImagePatch(image)

people_patches = image_patch.find("person")

if not people_patches:

return "No people found"

people_wearing_green_shirts = []

for index, person_patch in enumerate(people_patches):

if person_patch.verify_property("clothing", "shirt") and person_patch.verify_property("color", "green"):

people_wearing_green_shirts.append(index)

if not people_wearing_green_shirts:

return "No one"

wearing_green_shirts = ’, ’.join([people_patches[i].simple_query("Who is this?") for i in people_wearing_green_shirts])

return wearing_green_shirts

Query: "Is the blue ball inside the box?"

Incorrect Program:

def execute_command(image):

image_patch = ImagePatch(image)

ball_patches = image_patch.find("ball")

blue_ball = [ball for ball in ball_patches if ball.verify_property("color", "blue")]

blue_ball_left = blue_ball[0].left

box_patches = image_patch.find("box")

box_left = box_patches[0].left # Assuming there’s always a box present

if not box_patches:

return "No"

return "Yes"

Error: IndexError: list index out of range

Corrected Program:

def execute_command(image):

image_patch = ImagePatch(image)

ball_patches = image_patch.find("ball")

if not ball_patches:

return "No" # No ball found

blue_ball = [ball for ball in ball_patches if ball.verify_property("color", "blue")]

if not blue_ball:

return "No" # No blue ball found

box_patches = image_patch.find("box")

if not box_patches:

return "No" # No box found

blue_ball_patch = blue_ball[0]

for box_patch in box_patches:

if (box_patch.left <= blue_ball_patch.left and

box_patch.right >= blue_ball_patch.right and

box_patch.upper <= blue_ball_patch.upper and

box_patch.lower >= blue_ball_patch.lower):

return "Yes"

return "No"

Query: "Is the table bigger than the chair?"

Incorrect Program:

def execute_command(image):

image_patch = ImagePatch(image)

table_patches = image_patch.find("table")

chair_patches = image_patch.find("chair")

if not table_patches or not chair_patches:

return "No"

if table_patch.area < chair_patch.area:

return "Yes"

return "No"

Error: name ’table_patch’ is not defined

Corrected Program:

def execute_command(image):

image_patch = ImagePatch(image)

table_patches = image_patch.find("table")

chair_patches = image_patch.find("chair")

if not table_patches or not chair_patches:

return "No"

table_patch = table_patches[0]

chair_patch = chair_patches[0]

if table_patch.area > chair_patch.area:

return "Yes"

return "No"

Query: "What is the color of the largest ball?"

Incorrect Program:

def execute_command(image):

image_patch = ImagePatch(image)

ball_patches = image_patch.find("ball")[0]

ball_patches.sort(key=lambda x: x.area)

largest_ball = ball_patches[-1] # Picks the smallest ball due to incorrect indexing

return largest_ball.simple_query("What is the color?")

Error: ’ImagePatch’ object has no attribute ’sort’

Corrected Program:

def execute_command(image):

image_patch = ImagePatch(image)

ball_patches = image_patch.find("ball")

ball_patches.sort(key=lambda x: x.area)

largest_ball = ball_patches[-1]

return largest_ball.simple_query("What is the color?")

Query: INSERT_QUERY_HERE

Incorrect Program:

INSERT_CODE_HERE

Error: INSERT_ERROR_HERE

Corrected Program:

Listing 11: Reprompting with Errors ITM

⬇

INSERT_IMAGE_PATCH_API

You are provided a Python program that answers a query about an image, with a set of tests with the corresponding outputs and exected responses.

Correct the Python program such that it passes the tests.

- Ensure the corrected program is different than the incorrect program provided.

Query: "Verify image matches text="An airplane is flying in the sky, and birds are flying below it.""

Incorrect Program:

def execute_command(image):

image_patch = ImagePatch(image)

airplane = image_patch.find("airplane")

birds = image_patch.find("birds")

if airplane[0].vertical_center > birds[0].vertical_center:

return "Yes"

return "No"

Error: IndexError: list index out of range

Corrected Program::

def execute_command(image):

image_patch = ImagePatch(image)

airplane_patches = image_patch.find("airplane")

bird_patches = image_patch.find("bird")

if not airplane_patches or not bird_patches:

return "No"

airplane = airplane_patches[0]

birds_below = all(bird.vertical_center > airplane.vertical_center for bird in bird_patches)

return "Yes" if birds_below else "No"

Query: "Verify image matches text="The bird is flying above the tree, and a cat is sitting under the tree.""

Incorrect Program:

def execute_command(image):

image_patch = ImagePatch(image)

tree = image_patch.find("tree")

bird = image_patch.find("bird")

cat = image_patch.find("cat")

if not tree or not bird or not cat:

return "No"

if bird.vertical_center < tree.vertical_center and cat.vertical_center > tree.vertical_center:

return "Yes"

return "No"

Error: list has no attribute vertical_center

Corrected Program:

def execute_command(image):

image_patch = ImagePatch(image)

tree_patches = image_patch.find("tree")

bird_patches = image_patch.find("bird")

cat_patches = image_patch.find("cat")

if not tree_patches or not bird_patches or not cat_patches:

return "No"

tree = tree_patches[0]

bird_above = all(bird.vertical_center < tree.vertical_center for bird in bird_patches)

cat_below = all(cat.vertical_center > tree.vertical_center for cat in cat_patches)

return "Yes" if bird_above and cat_below else "No"

Query: "Verify image matches text="A man is riding a bicycle, and a dog is running beside him.""

Incorrect Program:

def execute_command(image):

image_patch = ImagePatch(image)

man = image_patch.find("man")

bicycle = image_patch.find("bicycle")

dog = image_patch.find("dog")

if not man or not bicycle or not dog:

return "No"

if abs(man[0].center_x - dog[0].center_x) < 50:

return "Yes"

return "No"

Error: ImagePatch has no attribute center_x

Corrected Program:

def execute_command(image):

image_patch = ImagePatch(image)

man_patches = image_patch.find("man")

bicycle_patches = image_patch.find("bicycle")

dog_patches = image_patch.find("dog")

if not man_patches or not bicycle_patches or not dog_patches:

return "No"

man = man_patches[0]

bicycle = bicycle_patches[0]

dog_beside = any(abs(dog.horizontal_center - man.horizontal_center) < 100 for dog in dog_patches)

return "Yes" if dog_beside else "No"

Query: "Verify image matches text="A man is holding a red balloon, and a child is reaching up to grab it.""

Incorrect Program:

def execute_command(image):

image_patch = ImagePatch(image)

man = image_patch.find("man")

balloon = image_patch.find("balloon")

child = image_patch.find("child")

if not man or not balloon or not child:

return "No"

if balloon[0].verify_property("red") and child[0].vertical_center < balloon[0].vertical_center:

return "Yes"

return "No"

Error: verify_property() missing 1 required positional argument: ’visual_property’

Corrected Program:

def execute_command(image):

image_patch = ImagePatch(image)

man_patches = image_patch.find("man")

balloon_patches = image_patch.find("balloon")

child_patches = image_patch.find("child")

if not man_patches or not balloon_patches or not child_patches:

return "No"

balloon = balloon_patches[0]

is_red_balloon = balloon.verify_property("balloon", "red")

child_below_balloon = all(child.vertical_center < balloon.vertical_center for child in child_patches)

return "Yes" if is_red_balloon and child_below_balloon else "No"

Query: "Verify image matches text="A cat is sitting on the table, and a book is lying beside it.""

Incorrect Program:

def execute_command(image):

image_patch = ImagePatch(image)

cat_patches = image_patch.find("cat")

book_patches = image_patch.find("book")

if not cat_patches or not book_patches:

return "No"

if abs(cat.horizontal_center - book.horizontal_center) < 50:

return "Yes"

return "No"

Error: name ’cat’ is not defined

Corrected Program:

def execute_command(image):

image_patch = ImagePatch(image)

cat_patches = image_patch.find("cat")

book_patches = image_patch.find("book")

table_patches = image_patch.find("table")

if not cat_patches or not book_patches or not table_patches:

return "No"

cat = cat_patches[0]

book = book_patches[0]

table = table_patches[0]

is_cat_on_table = cat.vertical_center < table.vertical_center and abs(cat.horizontal_center - table.horizontal_center) < 50

is_book_beside_cat = abs(book.horizontal_center - cat.horizontal_center) < 50

return "Yes" if is_cat_on_table and is_book_beside_cat else "No"

Query: INSERT_QUERY_HERE

Incorrect Program:

INSERT_CODE_HERE

Error: INSERT_ERROR_HERE

: Visual Unit Tests for More Robust Visual Programming

Abstract

1 Introduction

2 Related Work

3 Method

3.1 Task Definition

3.2 Unsupervised Visual Unit Test Generation

3.2.1 Candidate Unit Test Generation ψ\psi

3.2.2 Unit Test Coverage Sampling σ\sigma

3.2.3 Image Generation MM

3.3 Program Selection Based on Unit Test Scores

3.4 Visual Unit Test Utilization Methods

4 Experimental Setup

4.1 Data

4.2 Baselines

4.3 Implementation Details

5 Strategies for Visual Unit Test Generation

6 Strategies of Visual Unit Test Utilization

7 Human Evaluation

8 Conclusion and Future Work

References

Appendix A Data

Appendix B Unit Test Sampling Pseudocode

Appendix C Program Generation and Execution

C.1 Generation Details

C.2 Image Patch API

C.3 In-Context Examples

Appendix D Unit Test Generation

D.1 Implementation Details

D.2 In-Context Examples

D.3 Unit Test Candidate Generation

D.4 Image Generation

D.4.1 LM Grounded Diffusion

Appendix E Strategies for Visual Unit Test Generation

E.1 Unit Test Sampler σ\sigma

E.2 Image Generator MM

E.3 Scoring function hh

E.4 Aggregate Scorer HH

Appendix F Visual Unit Test Utilization Methods

F.1 Best Program Selection

F.2 Answer Refusal

F.3 Re-prompting

F.3.1 Implementation Details

F.3.2 Additional Results

F.4 Reward Design for Reinforcement Learning

F.4.1 Implementation Details

F.4.2 Additional Analysis

Appendix G End-to-End Fallback Methods

G.1 Implementation Details

G.1.1 VQA

G.1.2 Image-Text-Matching

G.2 Results with Fallback Method on Exception

Appendix H Human Evaluation

H.1 Unit Test Evaluation

H.2 Program Correctness Evaluation

Appendix I Limitations and Social Ethics Impact

I.1 Limitations

I.2 Social Ethics Impact

Appendix J Qualitative Examples

3.2.1 Candidate Unit Test Generation $\psi$

3.2.2 Unit Test Coverage Sampling $\sigma$

3.2.3 Image Generation $M$

E.1 Unit Test Sampler $\sigma$

E.2 Image Generator $M$

E.3 Scoring function $h$

E.4 Aggregate Scorer $H$