This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

[Uncaptioned image]: Visual Unit Tests for More Robust Visual Programming

Artemis Panagopoulou†,   Honglu Zhou   Silvio Savarese   Caiming Xiong
Chris Callison-Burch   Mark Yatskar   Juan Carlos Niebles
Salesforce AI Research   University of Pennsylvania
https://artemisp.github.io/viunit/
Work done during internship at Salesforce.
Abstract

Programming based approaches to reasoning tasks have substantially expanded the types of questions models can answer about visual scenes. Yet on benchmark visual reasoning data, when models answer correctly, they produce incorrect programs 33% of the time. These models are often right for the wrong reasons and risk unexpected failures on new data. Unit tests play a foundational role in ensuring code correctness and could be used to repair such failures. We propose Visual Unit Testing (ViUniT), a framework to improve the reliability of visual programs by automatically generating unit tests. In our framework, a unit test is represented as a novel image and answer pair meant to verify the logical correctness of a program produced for a given query. Our method leverages a language model to create unit tests in the form of image descriptions and expected answers and image synthesis to produce corresponding images. We conduct a comprehensive analysis of what constitutes an effective visual unit test suite, exploring unit test generation, sampling strategies, image generation methods, and varying the number of programs and unit tests. Additionally, we introduce four applications of visual unit tests: best program selection, answer refusal, re-prompting, and unsupervised reward formulations for reinforcement learning. Experiments with two models across three datasets in visual question answering and image-text matching demonstrate that ViUniT improves model performance by 11.4%. Notably, it enables 7B open-source models to outperform gpt-4o-mini by an average of 7.7% and reduces the occurrence of programs that are correct for the wrong reasons by 40%.

1 Introduction

Refer to caption
Figure 1: Refer to caption Framework Overview. Given a query qq about an image, the unit test generator ψ\psi generates a set 𝒯cand=ψ(q,p)\mathcal{T}_{\text{cand}}=\psi(q,p) of MM candidate pairs ti=(ci,yi)t_{i}=(c_{i},y_{i}), each consisting of an image caption cic_{i} and an expected answer yiy_{i} (Section 3.2.1). The coverage sampler σ\sigma then subsamples KK pairs from 𝒯cand\mathcal{T}_{\text{cand}}, forming the subset 𝒯K\mathcal{T}_{K} (Section 3.2.2). These captions are passed to an image generator MM to create the corresponding images vi=M(ci)v_{i}=M(c_{i}) for each unit test (Section 3.2.3). Each candidate program is subsequently executed, and gets assigned a score S(p)S(p) by the scorer HH based on its performance on the unit tests (Section 3.3). Finally, the highest scoring program is selected.

Visual Programming [14, 49], which involves generating executable programs that leverage state-of-the-art specialist systems (e.g. object detection, captioning, etc.), has emerged as an effective method for tackling compositional reasoning tasks [48, 15]. Often correct visual programs must be inferred without training programs because they are expensive to annotate. Recently, some methods improve the performance of visual program synthesis by leveraging programs that yield correct results on training data [23, 29]. While these approaches have shown improvements, a critical limitation persists: visual programs can be right for the wrong reasons. For example, human evaluation of 100 visual programs resulting in correct responses generated by CodeLlama-7B111CodeLlama-7B [41] is a leading open source large language model (LLM) for questions in GQA [20], showed that only 33% of them were actually correct and 70% of the incorrect programs (23% of the total) would require significant rewriting to be correct.

To mitigate this prevailing issue, we propose Visual Unit Testing ([Uncaptioned image]), a framework for automatically generating unit tests for visual programs. While automatic unit test generation has gained momentum in text-based tasks [5, 46, 2, 12, 50], its application to visual program synthesis has been limited. Recent efforts toward visual units tests focused primarily on checking program return value types (e.g. the output falling outside a range of options, like yes or no) [25]. However, this approach does not assess the program’s execution or logical correctness, limiting the types of errors it can address. In this work, we bridge this gap by addressing challenges that have hindered unit test use in visual question answering (VQA) and image-text-matching (ITM).

As seen in Figure 1, visual programming converts queries to code that executes on test images to provide a response. For such programs, unit tests take the form of images and expected answers. Units test are difficult to construct because they need to have sufficient coverage to diagnose errors. To solve this problem, we leverage language models to generate candidate sets of descriptions of images that could test the code (Section 3.2.1). We formulate an optimization criterion to select ones that maximize coverage of possible program inputs and outputs (Section 3.2.2), and convert selected descriptions to images (Section 3.2.3). Our approach is entirely unsupervised with no accompanying annotations.

Unit tests can be used to identify incorrect programs but integrating this signal to improve model behavior is challenging. In Section 3.4 we explore several mechanisms, summarized in Figure 2, including:

  1. wide

    Best program selection: Given a set of program candidates we select the one that passes the most test cases. This approach achieves a 7.7-point improvement over gpt-4o-mini (Table 1) and reduces right-for-wrong-reason programs by 40% (Section 7).

  2. wiide

    Re-prompting: We use unit test outputs to guide the generation of improved programs when initial programs perform poorly on the unit test suite. Relative to regeneration without unit tests, programs are over 3% more accurate (Table 3).

  3. wiiide

    Unsupervised Reinforcement Learning (RL) Reward Design: We use unit test scores as feedback to fine-tune an LLM on programs more likely correct for the right reasons, surpassing supervised correctness-based rewards by an average of 1.3 points across tasks (Table 4).

  4. wivde

    Answer refusal: Unit test scores are used to assess program confidence, reverting to an end-to-end model if the program is not robust, achieving up to 0.8 F1 score in correctly refusing programs that would fail (Figure 9).

To summarize our contributions, we present [Uncaptioned image], the first framework to introduce unit tests that verify the logical correctness of visual programs. We conduct a broad exploration of unit test generation configurations (Section 5), showing that maximizing coverage is an important criterion. We introduce four ways to leverage unit-tests to improve models (Section 3.4): best program selection, answer refusal, re-prompting, and unsupervised reward design for reinforcement learning. Overall, integrating unit-tests improves frozen-LLM accuracy by 11.4% and enables 7B open-source LLMs to outperform proprietary models like gpt-4o-mini by an average of 7.7 points, while improving underlying code correctness. Broader adoption of unit-test suits will significantly enchase robustness and trust of visual programming approaches.

2 Related Work

Visual Program Synthesis: The recent advancements in LLMs [1, 33, 53, 54, 4, 3, 21, 36] have led to their use as a planning interface for the modularization of tools to execute complex reasoning tasks involving multiple modalities [34, 42, 47, 9, 29] and as a reasoning module for visual agents [58, 59, 57, 16]. Specialized coding LLMs [41, 27, 51, 13] have demonstrated significant potential in addressing visual challenges by generating executable code based on contextual demonstrations [49, 14, 15, 10, 55] with comparable or better performance to vision language models [28, 37, 31, 6]. Attempts to improve the initial paradigm involve automatically generating a pool of effective programs to retrieve as in-context examples [48] and tuning a model through reinforcement learning by sampling programs that succeed on the training set [23]. More relevant to this work, Hu et al. [19] distill program reasoning into a VLM as chain-of-thought reasoning by generating multiple programs per query and selecting the best one, either by using the ground truth answer as a proxy for correctness or by having it evaluated by an LLM. However, a critical issue remains: some generated programs achieve correct outcomes without sound reasoning, which we address in this paper.

LLM Unit Test Generation: Unit tests have been used as reinforcement learning signal to train code-generating LLMs [27, 5, 44, 12, 45, 7]. Existing methods for automatic unit test generation with LLMs [5, 2, 12, 50] focus primarily on text-based tasks, generating entire unit test scripts. However, these approaches often result in issues like compilation errors, low coverage, redundant assertions, and empty tests [46]. Recent work [25] proposes property testing on the outputs of visual programs by leveraging LLMs to generate properties that should be satisfied by the output given the query (e.g. the output should be a color if the query asks for one). Yet, this method inherits many limitations of LLM generated script-based unit testing, and crucially, it fails to assess logical correctness—meaning it overlooks cases where program outputs may be right for the wrong reasons. Instead, we propose a method of generating unit tests to verify the execution of visual programs, without requiring an LLM to directly generate unit-test scripts, avoiding such issues that tend to accompany the automatic generation of unit tests using LLMs. In particular, we use LLMs to generate image descriptions and expected answers without requiring any direct code generation. Image descriptions and expected answers are then transformed to a unit test using a text-to-image diffusion model [40].

Refer to caption
Figure 2: Visual Unit Testing Utilization Strategies (Section 3.4).

3 Method

In this section, we formalize the tasks of visual program synthesis and unit test generation (Section 3.1) and introduce our [Uncaptioned image]  framework (Section 3.2). Our method comprises two main components: unsupervised generation of visual unit tests (Section 3.2) and unit test scoring (Section 3.3). We propose four ways to leverage unit tests in Section 3.4: Best Program Selection, Answer Refusal, Re-Prompting, and Unsupervised RL Reward Design.

3.1 Task Definition

Visual Program Synthesis: Given a visual input vv and a textual query qq about vv, our goal is to synthesize a program pp that correctly answers qq about vv. Each program p𝒫p\in\mathcal{P} is executed on the visual input vv using an execution engine ϕ\phi, yielding a predicted answer y^=ϕ(p,v)\hat{y}=\phi(p,v). Our objective is to select the program pp^{\ast} that is most likely to produce the correct answer yy^{\ast} to the query qq about vv, formalized as:

p\displaystyle\vspace{-.4cm}p^{\ast} =argmaxp𝒫Pr(ϕ(p,v)y).\displaystyle=\arg\max_{p\in\mathcal{P}}\Pr\left(\phi(p,v)\equiv y^{\ast}\right). (1)

Visual Unit Testing: To assess the candidate programs, we employ a unit test generator ψ\psi, which generates a set of unit tests 𝒯=ψ(q)\mathcal{T}=\psi(q). Each unit test ti𝒯t_{i}\in\mathcal{T} consists of a test visual input viv_{i} and the corresponding correct answer yiy_{i} to the query qq on that input ti=(vi,yi)t_{i}=(v_{i},y_{i}). For each candidate program p𝒫p\in\mathcal{P}, we execute it on all test inputs viv_{i} to obtain outputs y^i=ϕ(p,vi),for ti𝒯\hat{y}_{i}=\phi(p,v_{i}),~{}\text{for }t_{i}\in\mathcal{T}.

3.2 Unsupervised Visual Unit Test Generation

Refer to caption
Figure 3: Unit Test Examples generated by Refer to caption

Given a program pp to solve a query qq, our goal is to generate a set of unit tests 𝒯\mathcal{T} comprising input images and expected answers, as shown in Figure 3. This process involves three main steps: Candidate Unit Test Generation (Section 3.2.1), Unit Test Sampling (Section 3.2.2), and Image Generation (Section 3.2.3).

3.2.1 Candidate Unit Test Generation ψ\psi

As illustrated in Figure 1, rather than generating images directly for unit tests, we first create image descriptions with expected answers. This approach reduces computational overhead during the preliminary stage of unit test coverage sampling, after which we generate images only for those tests that are included in the final unit test suite 𝒯\mathcal{T}. In particular, we first generate a superset of MM candidate unit tests using the unit test generator ψ\psi, which is implemented as an auto-regressive large language model. The unit test generator ψ\psi can take both the query qq and the program implementation pp as inputs 𝒯cand=ψ(q,p)={t1,t2,,tM}\mathcal{T}_{\text{cand}}=\psi(q,p)=\{t_{1},t_{2},\ldots,t_{M}\}. Each candidate unit test tit_{i} consists of an image caption cic_{i} and an expected answer yiy_{i}. We explore whether including the program implementation pp provides useful signals for unit test generation (Section 5), despite conventional engineering practices that advocate for implementation-independent unit tests. This allows us to investigate whether this principle extends to visual unit testing.

3.2.2 Unit Test Coverage Sampling σ\sigma

Unit tests verify the behavior of code and should exhibit high isolation and coverage [24]. In the context of visual programs, isolation is trivial since each program is a self-contained function. However, achieving high coverage—ensuring that the tests collectively exercise as much of the codebase as possible—is non-trivial due to the computational overhead of executing all candidate tests. To address this, we define coverage metrics tailored for visual programming unit tests, focusing on maximizing the diversity of both expected answers and visual inputs. The coverage sampler σ\sigma subsamples KK pairs from 𝒯cand\mathcal{T}_{\text{cand}}, forming the subset 𝒯K\mathcal{T}_{K}.
Coverage by Answer: We aim to include tests that cover all possible expected answers present in the candidate set. Let Y={yiti𝒯cand}Y=\{y_{i}\mid t_{i}\in\mathcal{T}_{\text{cand}}\} be the set of all expected answers in 𝒯cand\mathcal{T}_{\text{cand}}. We define the answer diversity criterion as ensuring that for every possible answer yYy\in Y, there is at least one test ti𝒯Kt_{i}\in\mathcal{T}_{K} such that yi=yy_{i}=y:

yY,ti𝒯K such that yiy.\displaystyle\vspace{-.6cm}\forall y\in Y,\quad\exists t_{i}\in\mathcal{T}_{K}\text{ such that }y_{i}\equiv y. (2)

Coverage by Input: To maximize the diversity of visual inputs without generating all possible images, we operate on the image captions. We define an encoding function EE that maps a caption cc to a feature vector. We aim to maximize the input diversity score σV(𝒯K)\sigma_{V}(\mathcal{T}_{K}), defined as the maximum pairwise distance between the encoded captions:

σV(𝒯K)\displaystyle\vspace{-.4cm}\sigma_{V}(\mathcal{T}_{K}) =maxti,tj𝒯K,ijE(ci)E(cj)\displaystyle=\max_{t_{i},t_{j}\in\mathcal{T}_{K},\,i\neq j}\left\|E(c_{i})-E(c_{j})\right\|\vspace{-.2cm} (3)

This encourages the selection of tests with diverse descriptions, which in turn is likely to yield diverse images.
Coverage by Answer then Input: We begin by selecting one test for each possible answer to satisfy the answer diversity criterion (Equation 2). Then, we iteratively select additional tests to maximize σV(𝒯K)\sigma_{V}(\mathcal{T}_{K}) using the following criterion until KK tests are selected, forming the subset 𝒯K\mathcal{T}_{K}.

tnew\displaystyle\vspace{-.4cm}t_{\text{new}} =argmaxt𝒯cand𝒯Kmaxt𝒯KE(ct)E(ct).\displaystyle=\arg\max_{t\in\mathcal{T}_{\text{cand}}\setminus\mathcal{T}_{K}}\max_{t^{\prime}\in\mathcal{T}_{K}}\left\|E(c_{t})-E(c_{t^{\prime}})\right\|.\vspace{-.2cm} (4)

3.2.3 Image Generation MM

For each selected unit test ti=(ci,yi)𝒯Kt_{i}=(c_{i},y_{i})\in\mathcal{T}_{K}, we generate the corresponding image viv_{i} using a text-to-image model MM to yield the final unit-test suite 𝒯={(M(ci),yi)ti𝒯K}\mathcal{T}=\{(M(c_{i}),y_{i})\mid\forall t_{i}\in\mathcal{T}_{K}\}. We employ three state-of-the-art diffusion models: SDv1.4 [40], SDXL3 [38], and LM Guided Diffusion [30] which utilizes automatically generated templates with phrases and bounding boxes for spatial conditioning [30]. To provide these additional signals, we prompt an LLM with in-context examples and the caption cic_{i} to generate pairs of phrases and bounding boxes (phi,bbi)(ph_{i},bb_{i}) to feed into the text-to-image model: vi=M(ci,(phi,bbi))v_{i}=M(c_{i},(ph_{i},bb_{i})).

3.3 Program Selection Based on Unit Test Scores

We select the program pp^{\ast} that succeeds on most unit tests by Equation 6, where the overall score S(p)S(p) is computed by an aggregator HH over individual scores sti=h(yi^,yi)s_{t_{i}}=h(\hat{y_{i}},y_{i}).

Individual Unit Test Scorer hh: For each program pp and test ti=(vi,yi)𝒯Kt_{i}=(v_{i},y_{i})\in\mathcal{T}_{K}, we execute pp on viv_{i} to obtain the predicted answer y^i=ϕ(p,vi)\hat{y}_{i}=\phi(p,v_{i}). We define a scoring function hh that assigns a score stis_{t_{i}} based on the program’s output:

sti\displaystyle\vspace{-.4cm}s_{t_{i}} =h(y^i,yi)={ϵr,if runtime error,ϵc,if compilation error,𝕀{y^iyi},otherwise\displaystyle=h(\hat{y}_{i},y_{i})=\begin{cases}-\epsilon_{r},&\text{if runtime error},\\ -\epsilon_{c},&\text{if compilation error},\\ \mathbb{I}\{\hat{y}_{i}\equiv y_{i}\},&\text{otherwise}\end{cases} (5)

where ϵr\epsilon_{r} and ϵc\epsilon_{c} are runtime and compilation error penalties and 𝕀\mathbb{I} is the indicator function.

Score Aggregator HH: The individual scores stis_{t_{i}} are aggregated to compute an overall score S(p)=H({stiti𝒯})\small S(p)=H(\{s_{t_{i}}\mid t_{i}\in\mathcal{T}\}). Here, HH represents the averaging function. The program pp^{\ast} with the highest score is selected as the best candidate approximating Equation 1 by:

p\displaystyle\vspace{-.6cm}p^{\ast} =argmaxp𝒫S(p).\displaystyle=\arg\max_{p\in\mathcal{P}}S(p).\vspace{-.2cm} (6)

3.4 Visual Unit Test Utilization Methods

Figure 2 illustrates how to leverage visual unit tests in four ways, further elaborated below:

Best Program Selection: Given a set of candidate programs 𝒫={p1,p2,,pN}\mathcal{P}=\{p_{1},p_{2},\ldots,p_{N}\} for a query qq, our goal is to select the program pp^{\ast} that is most likely to produce the correct answer when executed on the visual input vv. We utilize the unit test scores S(p)S(p) computed for each program p𝒫p\in\mathcal{P} as described in Section 3.3. The best program–the program succeeds on most unit tests– is selected by solving the optimization problem in Equation 6.
Answer Refusal: If the maximum unit test score S(p)S(p^{\ast}) falls below a threshold θ\theta, indicating low confidence in all candidate programs, we refuse to provide a programmatic answer. Instead, we retreat to an end-to-end fallback method (refer to supplement for details). Formally, the decision rule is: If S(p)<θ, refuse to answer and redirect\text{If }S(p^{\ast})<\theta,\text{ refuse to answer and redirect}. Otherwise, we proceed to execute the selected program pp^{\ast} on the original visual input vv to obtain the final answer y^=ϕ(p,v)\hat{y}=\phi(p^{\ast},v). The hyperparameter θ\theta balances a trade-off between attempting to answer with potentially incorrect programs and deferring to a more reliable but less interpretable method.
Re-Prompting: If all generated programs 𝒫\mathcal{P} fail to meet the threshold θ\theta (i.e., maxp𝒫S(p)<θ\max_{p\in\mathcal{P}}S(p)<\theta), we employ a re-prompting strategy to generate better candidate programs using feedback from unit tests:

𝒫\displaystyle\vspace{-.6cm}\mathcal{P}^{\prime} =π(x(q)+)\displaystyle=\pi\left(x^{\prime}(q)+\mathcal{F}\right)\vspace{-.4cm} (7)

where: x(q)x^{\prime}(q) is an adaptation of the original input containing the API, the query qq, and in-context examples of unit-test-feedback corrections, and \mathcal{F} is the feedback derived from unit test results 222 \mathcal{F} comprises unit test image descriptions, expected answers, and the predicted answers generated by the program in the current iteration., summarizing the discrepancies between expected and actual outputs, and π\pi is the program generator.

We select the best program pp^{\ast\ast} from the new set 𝒫\mathcal{P}^{\prime} based on their unit test scores p=argmaxp𝒫S(p)p^{\ast\ast}=\arg\max_{p^{\prime}\in\mathcal{P}^{\prime}}S(p^{\prime}). If S(p)θS(p^{\ast\ast})\geq\theta, we execute pp^{\ast\ast} on the original visual input vv. Otherwise, we may repeat the re-prompting process until a predefined number of iterations is reached.
Unsupervised Reinforcement Learning Reward Design We propose to design RL rewards based on visual unit tests, aiming not only to provide extra supervision but also curtail policy deterioration due to logically incorrect programs [23]. The goal is to optimize a policy implemented as an autoregressive language model for program generation πw\pi_{w}, parameterized by ww, by minimizing the reward-weighted loss over the dataset DD, where each example consists of an image vv, user query qq, generated program pp by the previous iteration’s policy πwitr1\pi_{w^{\text{itr}-1}}, and ground truth answer yy:

J(w)\displaystyle\vspace{-.4cm}J(w) =𝔼(v,q,p,y)D[R(v,p,y)LNLL(p,q;w)],\displaystyle=\mathbb{E}_{(v,q,p,y)\sim D}\left[R(v,p,y)\,L_{\text{NLL}}(p,q;w)\right],\vspace{-.2cm} (8)

where LNLL(p,q;w)=l=1Llogπw(pl|p1:l1,x(q))\small L_{\text{NLL}}(p,q;w)=-\sum_{l=1}^{L}\log\pi_{w}(p_{l}|p_{1:l-1},x(q)) is the negative log-likelihood loss on next token prediction and LL is the sequence length .

Khan et al. [23] introduce a correctness reward based on performance on the training set:

RCorrect(v,p,y)\displaystyle\vspace{-.4cm}R_{\text{Correct}}(v,p,y) ={1,if ϕ(p,v)y,0,otherwise.\displaystyle=\begin{cases}1,&\text{if }\phi(p,v)\equiv y,\\ 0,&\text{otherwise}.\end{cases}\vspace{-.4cm} (9)

However, this approach can lead to sparse rewards and may falsely reward programs that are right for incorrect reasons. Khan et al. [23] address this issue through human corrections to stabilize training. Instead we reformulate the reward using feedback from the visual unit tests:

RViUnit(v,p)\displaystyle\vspace{-.4cm}R_{\text{ViUnit}}(v,p) ={1,if S(p)θ,S(p),otherwise,\displaystyle=\begin{cases}1,&\text{if }S(p)\geq\theta,\\ S(p),&\text{otherwise},\end{cases}\vspace{-.4cm} (10)

where θ\theta is a passing threshold. We terminate policy iteration on declining reward. Following earlier work [22], we assume that an optimal policy will keep increasing an optimal reward function RR^{\ast}. Thus, when our proxy reward RR declines (i.e., regret increases), there are theoretical guarantees that we are not far from the optimal policy that can be learned under RR.

4 Experimental Setup

Below is the experimental setup: datasets (Section 4.1), baselines (Section 4.2), and implementation details (Section 4.3).

4.1 Data

We utilize three compositional reasoning datasets: GQA [20] for Visual Question Answering (VQA), SugarCREPE [17], and Winoground [52] for Image-Text Matching (ITM), assessing model performance via accuracy metrics. For GQA, we calculate accuracy using an implementation by Surís et al. [49], which standardizes and compares generated answers for exact matches.333https://github.com/cvlab-columbia/viper/blob/main/datasets/gqa.py Our experimental setup incorporates training and testing splits sampled similar to Khan et al. [23], specifically testing on 502 examples from the GQA balanced-val split and training on 1022 examples from the balanced-train split, with 10 samples per question group. In SugarCREPE, we utilize 788 examples for training by subsampling approximately 10% of the dataset balanced across question types, excluding our validation split. The validation subset consists of 560 examples and includes both positive and negative image-text pairings from 40 samples from each of the 7 question types. The full Winoground dataset is used, encompassing all possible positive and negative pairings for a total of 1600 test examples, with the SugarCREPE dataset employed for training purposes. Refer to the supplement for further dataset details.

Refer to caption
(a) Unit Tests Generated by Different Sampling Methods
Refer to caption
(b) Unit Tests Generated by Different Diffusion Methods
Figure 4: Comparison of Unit Tests Generated by Different Methods

4.2 Baselines

We evaluate against the following baselines: 
Base Setup: Following the prototypical use of visual programs [49, 14], we prompt the LLM to generate a single program per query, which is executed to retrieve a response.
Most Common Answer: To leverage multiple programs, we compare performance with selecting the most common answer across executed programs if one exists.
Error Re-prompting: To evaluate the effectiveness of unit-test incorporation in program correction via unit-test re-prompting, we benchmark performance against a method that leverages error-traces as feedback \mathcal{F} in Equation 7. Further details are provided in the supplement. 
Correctness Reward: We baseline unsupervised unit-test RL reward fomulation against the supervised correctness reward described by Equation 9.

4.3 Implementation Details

We provide a summary of key implementation details, with additional information in the supplement. Experiments were conducted on two A100 40GB GPUs, though a single GPU suffices for smaller API models. Results report the mean and standard deviation across 3 runs.
Program Generation Models: Three program generator models are employed, codellama/CodeLlama-7b-Python-hf [41] and google/codegemma-7b-it [51] hosted on Hugginface and served by VLLM [26], as well as gpt-4o-mini [1] served by OpenAI. We use HuggingFace’s SFT-Trainer to train the RL policy using LoRA [18] with θ=0.8\theta=0.8 in Equation 10. Models are prompted with an API adapted from ViperGPT [49] and 4 in-context examples.
API Models: Object detection is performed using IDEA-Research/grounding-dino-base [32]. For image-text matching, we use openai/clip-vit-large-patch14-336 [39], and for VQA answering, we employ Salesforce/blip2-flan-t5-xxl [28]. All models are accessed through HuggingFace.
Unit Test Generation Models: We use meta-llama/Meta-Llama-3-8B-Instruct [8] to generate image descriptions and expected answers for unit test candidates. The unit test sampler is implemented with sentence-transformers, using the all-MiniLM-L6-v2 [56] model to embed image descriptions. For image generation, we use the diffusers library, specifically CompVis/stable-diffusion-v1-4 for SDv1.4, longlian/lmd_plus for LM Guided Diffusion, and stabilityai/stable-diffusion-xl-base-1.0 for SDXL3. 
Program Scoring and Execution: Program executions are capped at 120 seconds. Unit test scoring error penalties are set to ϵr=ϵc=0.1\epsilon_{r}=\epsilon_{c}=0.1 (Equation 5). Unless specified, no end-to-end model retreat was employed on exception.

5 Strategies for Visual Unit Test Generation

We explore different unit test generation configurations applied on best program selection using a smaller dataset of three questions from each group in GQA, and each tag in WinoGround, yielding 303 and 504 samples, respectively.

Refer to caption
Figure 5: Accuracy across varying unit test and program counts.

Number of unit tests KK. Figure 5 illustrates that increasing both the number of unit tests and the number of candidate programs improves accuracy on both datasets. Accuracy rises substantially with the addition of unit tests, particularly from 1 to 5 tests, after which gains diminish. Higher numbers of programs (e.g., 4 or 5) consistently yield better accuracy compared to fewer programs, underscoring the benefit of exploring multiple candidate solutions.

Refer to caption
Figure 6: Program in context in unit test generation for GQA.

Unit Test Generator ψ\psi. Figure 6 demonstrates that in low unit test settings, incorporating program information into unit test generation yields comparable results to query-only approaches. However, as the number of unit tests and programs increases, disregarding implementation details proves significantly more effective. This aligns with software engineering best practices, where unit tests are designed to remain independent of specific implementations.

Refer to caption
Figure 7: Sampling method comparison at 5 programs.

Unit Test Sampler σ\sigma. Figure 7 demonstrates the impact of different unit test sampling methods on model accuracy. In GQA, “Coverage By Answer then Input” shows increasing performance as the number of unit tests grows, thus allowing the saturation of possible answers. Figure 4(a) highlights limitations of the other methods: “Coverage by Input” may suffer from reduced answer diversity, and “Coverage by Answer” could involve repetitive inputs. In WinoGround there is negligible difference across methods, due to its restriction to two answers, preventing significant sampling diversity. Nevertheless, an analysis of performance by question-type in the supplement shows that this sampling method yields higher results for attribute-related queries in both datasets.

Refer to caption
Figure 8: Image generator comparison at 5 programs.

Image Generator MM. Figure 8 illustrates the impact of different diffusion models. In GQA at lower unit test settings LM Guided diffusion yields some accuracy improvements, while for WinoGround, LM Guided diffusion only helps in lower program settings, with quick convergence as the number of program increases. The benefit of LM Guided diffusion is primarily driven by improved tests when spatial positioning is critical as shown with the result breakdowns in the supplement and illustrated in Figure 4(b).

Refer to caption
Figure 9: Refusal evaluation at different passing thresholds.

Scoring function hh. The supplement presents results with varying error penalties, illustrating that in few unit test settings imposing error penalties enhances the likelihood of selecting a successful program.

VQA Image-Text Matching
LLM # Prog # UT GQA Winoground SugarCREPE Avg.
Base Setup
gpt-4o-mini 1 0 42.03±1.21 44.98±0.75 38.75±0.47 41.92±0.81
CodeLlama-7B 1 0 35.99±2.94 38.83±0.45 30.54±0.99 35.12±1.46
CodeGemma-7B 1 0 41.83±2.26 39.60±1.38 42.56±1.52 41.33±1.72
Most Common Answer Setup
CodeLlama-7B 5 0 42.50±1.50 45.85±0.77 41.67±1.79 43.34±1.35
CodeGemma-7B 5 0 43.89±0.98 46.04±1.48 46.67±1.69 45.53±1.38
ViUniT Setup (Ours)
CodeLlama-7B 5 5 49.27±1.33 49.73±0.73 47.02±1.19 48.67±1.08
CodeGemma-7B 5 5 48.01±1.05 51.92±0.90 51.85±2.16 50.59±1.37
Table 1: Accuracy on Best Program Selection. Bold is best.
VQA Image-Text Matching
LLM # Prog # UT GQA Winoground SugarCREPE Avg.
Reverting on Error
CodeLlama-7B 1 0 44.89±2.04 51.67±1.16 49.29±0.99 48.61±1.40
CodeGemma-7B 1 0 44.89±2.19 47.25±2.17 49.58±0.88 47.24±1.74
Reverting on ViUniT Threshold θ=0.7\theta=0.7 (Ours)
CodeLlama-7B 1 5 54.18±0.40 50.67±1.28 49.05±0.82 51.30±0.84
CodeGemma-7B 1 5 54.58±1.24 50.73±0.94 50.12±1.62 51.81±1.27
Table 2: Answer Refusal: Reverting to end-to-end model on error or unit test passing failure (θ=0.7\theta=0.7). Bold is best.
VQA Image-Text Matching
LLM Iter. # Prog # UT GQA Winoground SugarCREPE Avg.
Error Reprompting
CodeLlama-7B 1 1 0 37.92±2.68 42.46±0.57 33.21±0.64 37.86±1.30
CodeGemma-7B 1 1 0 42.63±2.42 42.42±1.91 44.52±1.05 42.63±2.42
ViUniT Reprompting θ=0.7\theta=0.7 (Ours)
CodeLlama-7B 1 1 5 46.68±2.52 51.85±0.40 47.68±2.17 48.74±1.69
CodeGemma-7B 1 1 5 45.75±0.30 48.19±2.28 48.21±1.12 47.38±1.23
Table 3: Accuracy of different re-prompting methods. Bold is best.
VQA Image-Text Matching
LLM # Prog # UT GQA Winoground SugarCREPE Avg.
Supervised Correctness Reward
CodeLlama-7B 1 0 39.18±4.88 48.65±0.87 39.58±2.75 42.47±2.83
CodeGemma-7B 1 0 43.03±5.08 45.98±2.64 46.31±2.26 45.11±3.33
Unsupervised ViUniT Reward (Ours)
CodeLlama-7B 1 0 40.57±2.10 46.52±0.81 41.85±1.44 42.98±1.45
CodeGemma-7B 1 0 45.68±2.45 49.29±0.43 46.55±0.69 47.17±1.19
Table 4: Comparison of RL with supervised correctness rewards versus unsupervised unit-test-based rewards. Bold is best.

6 Strategies of Visual Unit Test Utilization

Best Program Selection: Table 1 underscores the efficacy of [Uncaptioned image]  selection in identifying the most optimal program. Our approach demonstrates a notable average improvement of 11.4 accuracy points over the base setup and a substantial 7.7-point average gain over the gpt-4o-mini configuration. Furthermore, it surpasses most common answer selection by an average margin of 5.2 points.

Answer Refusal: Figure 9 illustrates the impact of varying the threshold θ\theta on the F1 score of refusing programs with incorrect answers (left), and the false pass failure rate (right), measured relative to the total number of programs. The minimal false pass failure rate at higher thresholds supports the use of unit test scores as a proxy for correctness during unsupervised model fine-tuning. Table 2 showcases an improvement of 3.6 points of reverting to a fixed model when S(p)<θ=0.7S(p)<\theta=0.7 compared to reverting only on error. For CodeLlama-7B, performance on image-text matching is similar between the two methods, as some programs yield correct answers despite failing unit tests. Although such programs impact final performance, a human inspection of 40 samples revealed that 65% were unreliable from the start.

Re-prompting: Table 3 demonstrates that re-prompting with [Uncaptioned image]  achieves an average improvement of 7.5 points over error-based re-prompting, with a notable 10.9-point increase for CodeLlama-7B, which performs lower in the base setting. The unit tests offer additional opportunities for refining the method’s initial response, as they go beyond error detection to assess program confidence, while also providing a measure of comparison between the programs.

RL Reward Design: The pattern of improvements is particularly interesting in the RL setting, where we find that [Uncaptioned image]  rewards outperform correctness rewards by an average of 1.3 points in accuracy despite not relying on the training labels. Additionally, we observe a notable reduction in the percentage of code leading to exceptions; errors decrease from 14.47% to 11.76% for CodeLlama and even more sharply from 11.73% to 4.68% for CodeGemma. These results indicate that heavily rewarding higher-quality code, as filtered through unit tests, encourages the development of a more robust and error-resistant policy.

7 Human Evaluation

We summarize key findings from two human evaluations that assess unit test quality and improvements in program reliability. Full details are available in the supplement.
Unit Test Evaluation: We randomly sampled 20 examples from each of three datasets, each corresponding to 5 unit tests, resulting in a total of 300 unit tests, each of which was judged by three annotators. Based on the majority annotator response, 75% of unit tests per sample were correct. Annotators could optionally comment on errors, with “Missing Object” noted as the most frequent issue.
Program Evaluation: To measure the effectiveness of unit tests in enhancing program reliability, we evaluated 100 VQA programs that correctly answered the queries both from the base and the unit-test best program selection setups. Two annotators with 3+ years of Python experience graded programs from 0 (Fully Correct) to 3 (Irrelevant). Under the unit test setup, 86% of programs were fully correct, compared to 77% in the base setup. Additionally, only 5% of programs were marked completely incorrect—with none deemed irrelevant—compared to 14% and 4%, respectively, in the base setup. Notably, the most common error type shifted from “Incorrect Logic” in the base setup to “Missing Checks (e.g., list index out of range)” in the unit-test setup.

8 Conclusion and Future Work

We introduce [Uncaptioned image], the first framework to automatically generate unit tests for verifying visual program correctness, addressing cases where programs may appear correct for the wrong reasons. Unit tests are leveraged in four ways: best program selection (+11.4 points over the base setup and +7.7 points over gpt4o-mini), answer refusal, re-prompting, and unsupervised RL reward design (+1.3 points over supervised rewards). Future directions include fine-grained test generation and broader task applications. By reinforcing logical correctness, [Uncaptioned image]  advances robustness and interpretability in visual programs.

References

  • Achiam et al. [2023] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • Alagarsamy et al. [2024] Saranya Alagarsamy, Chakkrit Tantithamthavorn, and Aldeida Aleti. A3test: Assertion-augmented automated test case generation. Information and Software Technology, 176:107565, 2024.
  • Anil et al. [2023] Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  • Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • Chen et al. [2023] Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. Codet: Code generation with generated tests. In The Eleventh International Conference on Learning Representations, 2023.
  • Dai et al. [2023] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • Dou et al. [2024] Shihan Dou, Yan Liu, Haoxiang Jia, Enyu Zhou, Limao Xiong, Junjie Shan, Caishuang Huang, Xiao Wang, Xiaoran Fan, Zhiheng Xi, Yuhao Zhou, Tao Ji, Rui Zheng, Qi Zhang, Tao Gui, and Xuanjing Huang. StepCoder: Improving code generation with reinforcement learning from compiler feedback. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4571–4585, Bangkok, Thailand, 2024. Association for Computational Linguistics.
  • Dubey et al. [2024] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
  • Gao et al. [2024] Zhi Gao, Yuntao Du, Xintong Zhang, Xiaojian Ma, Wenjuan Han, Song-Chun Zhu, and Qing Li. Clova: A closed-loop visual assistant with tool usage and update. Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
  • Ge et al. [2025] Jiaxin Ge, Sanjay Subramanian, Baifeng Shi, Roei Herzig, and Trevor Darrell. Recursive visual programming. In European Conference on Computer Vision, pages 1–18. Springer, 2025.
  • Goyal et al. [2017] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
  • Guilherme and Vincenzi [2023] Vitor Guilherme and Auri Vincenzi. An initial investigation of chatgpt unit test generation capability. In Proceedings of the 8th Brazilian Symposium on Systematic and Automated Software Testing, pages 15–24, 2023.
  • Guo et al. [2024] Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024.
  • Gupta and Kembhavi [2023] Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953–14962, 2023.
  • [15] Cheng Han, James Chenhao Liang, Qifan Wang, MAJID RABBANI, Sohail Dianat, Raghuveer Rao, Ying Nian Wu, and Dongfang Liu. Image translation as diffusion visual programmers. In The Twelfth International Conference on Learning Representations.
  • Hong et al. [2024] Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14281–14290, 2024.
  • Hsieh et al. [2024] Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kembhavi, and Ranjay Krishna. Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality. Advances in neural information processing systems, 36, 2024.
  • Hu et al. [2022] Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
  • Hu et al. [2024] Yushi Hu, Otilia Stretcu, Chun-Ta Lu, Krishnamurthy Viswanathan, Kenji Hata, Enming Luo, Ranjay Krishna, and Ariel Fuxman. Visual program distillation: Distilling tools and programmatic reasoning into vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9590–9601, 2024.
  • Hudson and Manning [2019] Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
  • Jiang et al. [2023] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  • [22] Jacek Karwowski, Oliver Hayman, Xingjian Bai, Klaus Kiendlhofer, Charlie Griffin, and Joar Max Viktor Skalse. Goodhart’s law in reinforcement learning. In The Twelfth International Conference on Learning Representations.
  • Khan et al. [2024] Zaid Khan, Vijay Kumar BG, Samuel Schulter, Yun Fu, and Manmohan Chandraker. Self-training large language models for improved visual program synthesis with visual reinforcement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14344–14353, 2024.
  • Khorikov [2020] Vladimir Khorikov. Unit Testing Principles, Practices, and Patterns. Simon and Schuster, 2020.
  • Koo et al. [2024] Jaywon Koo, Ziyan Yang, Paola Cascante-Bonilla, Baishakhi Ray, and Vicente Ordonez. PropTest: Automatic property testing for improved visual programming. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 8241–8256, Miami, Florida, USA, 2024. Association for Computational Linguistics.
  • Kwon et al. [2023] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023.
  • Le et al. [2022] Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Hoi. CodeRL: Mastering code generation through pretrained models and deep reinforcement learning. In Advances in Neural Information Processing Systems, 2022.
  • Li et al. [2023] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023.
  • Li et al. [2024] Zhuowan Li, Bhavan Jasani, Peng Tang, and Shabnam Ghadar. Synthesize step-by-step: Tools templates and llms as data generators for reasoning-based chart vqa. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13613–13623, 2024.
  • Lian et al. [2024] Long Lian, Boyi Li, Adam Yala, and Trevor Darrell. LLM-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models. Transactions on Machine Learning Research, 2024. Featured Certification.
  • Liu et al. [2024a] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024a.
  • Liu et al. [2024b] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In European Conference on Computer Vision. Springer, 2024b.
  • Longpre et al. [2023] Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. The flan collection: Designing data and methods for effective instruction tuning. In International Conference on Machine Learning, pages 22631–22648. PMLR, 2023.
  • Lu et al. [2024] Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao. Chameleon: Plug-and-play compositional reasoning with large language models. Advances in Neural Information Processing Systems, 36, 2024.
  • Marino et al. [2019] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019.
  • Nijkamp et al. [2023] Erik Nijkamp, Tian Xie, Hiroaki Hayashi, Bo Pang, Congying Xia, Chen Xing, Jesse Vig, Semih Yavuz, Philippe Laban, Ben Krause, et al. Xgen-7b technical report. arXiv preprint arXiv:2309.03450, 2023.
  • Panagopoulou et al. [2024] Artemis Panagopoulou, Le Xue, Ning Yu, Junnan Li, Dongxu Li, Shafiq Joty, Ran Xu, Silvio Savarese, Caiming Xiong, and Juan Carlos Niebles. X-instructblip: A framework for aligning x-modal instruction-aware representations to llms and emergent cross-modal reasoning. In Proceedings of the European Conference on Computer Vision (ECCV), 2024.
  • Podell et al. [2024] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. In The Twelfth International Conference on Learning Representations, 2024.
  • Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022.
  • Roziere et al. [2023] Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  • Schick et al. [2024] Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36, 2024.
  • Selvaraju et al. [2020] Ramprasaath R Selvaraju, Purva Tendulkar, Devi Parikh, Eric Horvitz, Marco Tulio Ribeiro, Besmira Nushi, and Ece Kamar. Squinting at vqa models: Introspecting vqa models with sub-questions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10003–10011, 2020.
  • Shen et al. [2023] Bo Shen, Jiaxin Zhang, Taihong Chen, Daoguang Zan, Bing Geng, An Fu, Muhan Zeng, Ailun Yu, Jichuan Ji, Jingyang Zhao, et al. Pangu-coder2: Boosting large language models for code with ranking feedback. arXiv preprint arXiv:2307.14936, 2023.
  • [45] Parshin Shojaee, Aneesh Jain, Sindhu Tipirneni, and Chandan K Reddy. Execution-based code generation using deep reinforcement learning. Transactions on Machine Learning Research.
  • Siddiq et al. [2023] Mohammed Latif Siddiq, Joanna Santos, Ridwanul Hasan Tanvir, Noshin Ulfat, FA Rifat, and V Carvalho Lopes. Exploring the effectiveness of large language models in generating unit tests. arXiv preprint arXiv:2305.00418, 2023.
  • Singh et al. [2019] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019.
  • [48] Aleksandar Stanić, Sergi Caelles, and Michael Tschannen. Towards truly zero-shot compositional visual reasoning with llms as programmers. Transactions on Machine Learning Research.
  • Surís et al. [2023] Dídac Surís, Sachit Menon, and Carl Vondrick. Vipergpt: Visual inference via python execution for reasoning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11888–11898, 2023.
  • Takerngsaksiri et al. [2024] Wannita Takerngsaksiri, Rujikorn Charakorn, Chakkrit Tantithamthavorn, and Yuan-Fang Li. Tdd without tears: Towards test case generation from requirements through deep reinforcement learning. arXiv preprint arXiv:2401.07576, 2024.
  • Team [2024] CodeGemma Team. Codegemma: Open code models based on gemma. arXiv preprint arXiv:2406.11409, 2024.
  • Thrush et al. [2022] Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5228–5238. IEEE Computer Society, 2022.
  • Touvron et al. [2023a] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  • Touvron et al. [2023b] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  • Ukai et al. [2024] Mahiro Ukai, Shuhei Kurita, Atsushi Hashimoto, Yoshitaka Ushiku, and Nakamasa Inoue. Adacoder: Adaptive prompt compression for programmatic visual question answering. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 9234–9243, 2024.
  • Wang et al. [2020] Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems, 33:5776–5788, 2020.
  • Wei et al. [2024] Yuxi Wei, Zi Wang, Yifan Lu, Chenxin Xu, Changxing Liu, Hao Zhao, Siheng Chen, and Yanfeng Wang. Editable scene simulation for autonomous driving via collaborative llm-agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15077–15087, 2024.
  • Yang et al. [2022] Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Yumao Lu, Zicheng Liu, and Lijuan Wang. An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI conference on artificial intelligence, pages 3081–3089, 2022.
  • Yang et al. [2023] Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023.

Appendix A Data

The three compositional reasoning datasets used in this work are GQA [20], SugarCREPE [17], and WinoGround [52]. Table 5 shows examples from each dataset, and table 6 summarizes the dataset statistics. For GQA validation we sample 5 questions from each of the 102 question groups from the balanced-val split with a total of 502 examples. For testing, we sample 10 questions per group from the balanced-train split yielding 1022 examples. Note that some groups such as typeVerifyC, stateChoose, and companyVerify do not have a sufficient amount of questions, so we sample the whole group. For SugarCREPE, we utilize 788 examples for training by subsampling 10% of the dataset balanced across the 7 question types, excluding our validation split. This validation subset consists of 560 examples and includes both positive and negative image-text pairings from 40 samples from each of the 7 question types. The full Winoground dataset is used, encompassing all possible positive and negative pairings for a total of 1600 examples, with the SugarCREPE dataset employed for training.

Image Question Answer
GQA
[Uncaptioned image] Are there any guys to the right of the brown horse? no
[Uncaptioned image] Which direction is the animal that looks white and brown looking at? forward
[Uncaptioned image] What type of animal is that fence behind of, an elephant or a giraffe? giraffe
SugarCREPE
[Uncaptioned image] Is there a white pitcher holding flowers in a window sill? yes
[Uncaptioned image] Are a cat and a dog napping together under a blanket on the couch? no
[Uncaptioned image] Is a dog sitting in front of a laptop on top of a bed? yes
WinoGround
[Uncaptioned image] Verify image matches text=“two humans and one wheel” yes
[Uncaptioned image] Verify image matches text=“red building with white shutters” no
[Uncaptioned image] Verify image matches text=“the person with the white collared shirt waters the plant while the other holds it” yes
Table 5: Dataset Samples
# Samples # Images # Questions # Answers # Question Types # Questions/Type
GQA
1022/502 1014/487 937/474 176/122 105/102 10/5
WinoGround
-/1600 -/800 -/800 -/2 -/70 -/8
SugarCREPE
788/560 335/260 765/557 2/2 7/7 52/80
Table 6: Dataset Statistics: Values are shown in {train/test} format. For SugarCREPE and WinoGround, both positive and negative image-text pairings are included. In GQA, question types are divided by the data field group, and in WinoGround by the data field tag. The training data for WinoGround consists of SugarCREPE.

Appendix B Unit Test Sampling Pseudocode

For clarity, Algorithm 1 presents the pseudocode for the unit test coverage sampling method described in Section 3.

Algorithm 1 Unit Test Sampling Algorithm
1:T={t1,t2,,tn}T=\{t_{1},t_{2},\dots,t_{n}\}, the set of texts
2:A={a1,a2,,am}A=\{a_{1},a_{2},\dots,a_{m}\}, the set of answers
3:f:TAf:T\rightarrow A, a function mapping each text to an answer
4:E(t)E(t), embedding function for text tt
5:kk, number of samples
6:use_answers, a boolean flag
7:SS, a subset of TT of size kk
8:function SampleTexts(T, A, f, E, k, use_answers)
9:     Initialize SS\leftarrow\emptyset
10:     if use_answers=True\texttt{\small use\_answers}=\text{True} then
11:         for each aiAa_{i}\in A do
12:              Select tt from TT such that f(t)=aif(t)=a_{i}
13:              SS{t}S\leftarrow S\cup\{t\}
14:              TT{t}T\leftarrow T\setminus\{t\}
15:         end for
16:     else
17:         Select a random tt from TT
18:         S{t}S\leftarrow\{t\}
19:         TT{t}T\leftarrow T\setminus\{t\}
20:     end if
21:     while |S|<k|S|<k do
22:         snewargmaxtTmaxsSE(t)E(s)s_{\text{new}}\leftarrow\arg\max_{t\in T}\max_{s\in S}\|E(t)-E(s)\|
23:         SS{snew}S\leftarrow S\cup\{s_{\text{new}}\}
24:         TT{snew}T\leftarrow T\setminus\{s_{\text{new}}\}
25:     end while
26:     return SS
27:end function

Appendix C Program Generation and Execution

In this section, we outline the implementation details for program generation and execution.

C.1 Generation Details

For program generation we use in context examples both in of-the-shelf inference, and finetuned model inference. Generation is conducted using VLLM with the following generation parameters: temperature=1.0, top_p=0.9, top_k=0.0, max_new_tokens=320, and num_beams=1. We set the temperature at a high value to ensure diversity in generated programs. For CodeLLaMA we prefix the prompt with <s>, and for CodeGemma we enclose it in <bos><start_of_turn>[..]<end_of_turn>

C.2 Image Patch API

We present the ImagePatch API in Code LABEL:code:api_prompt which we adapt the from Khan et al. [23] which is in turn adapted from ViperGPT Surís et al. [49]. We implement object detection using IDEA-Research/grounding-dino-base [32] with text_threshold=box_threshold=0.2, image-text-matching using openai/clip-vit-large-patch14-336 [39] using 0.8 similarity threshold for detection, and the underlying visual question answering module is Salesforce/blip2-flan-t5-xxl [28] loaded in 8-bits using BitsAndBytes with a maximum batch size of 4 and generation hyperparameters length_penalty=-1, num_beams=5, max_length=10,min_length=1,do_sample=False, top_p=0.9, repetition_penalty=1.0, and temperature=1 for QA and set length_penalty=1 and max_length=30 for captioning. All models are served by HuggingFace.

C.3 In-Context Examples

We present the in-context examples used for visual question answering and image-text matching in Codes LABEL:code:vqa_ice and LABEL:code:itm_ice respectively. Code execution is handled using multiprocessing with a batch size of 30, and a timeout of 120 seconds, after which a TimeOutException is raised if execution exceeds the limit.

Appendix D Unit Test Generation

D.1 Implementation Details

To generate the unit test imaage descriptions and expected answers we prompt meta-llama/Meta-Llama-3-8B-Instruct, executed via VLLM with the following generation parameters: temperature=0.7, top_p=0.9, top_k=0.0, max_new_tokens=512, and num_beams=1. We return 3 output sequences, from which we extract the unit tests, deduplicate them, and filter answers longer than five words since they are out of distribution to the task before feeding them to the sampling module.

D.2 In-Context Examples

We prompt the LLM with the system prompt presented below, as well as in-context examples presented in Codes LABEL:code:ut_gen_ice_vqa and LABEL:code:ut_gen_ice_itm for VQA and ITM respectively.

You are a skilled AI assistant specialized in generating test cases for programs that respond to queries about images.

D.3 Unit Test Candidate Generation

We experiment with two prompting methodologies for the unit test generation: Query-Only and Query+Implementation. The former only takes into account the user query to generate the unit-tests, while the latter takes into account also each generated program. We prompt the Visual Program Generator in the same way, but instead also include implementation examples, and the current implementation as shown in Code LABEL:code:vqa_ice_implementation.

D.4 Image Generation

To generate the images we use the diffusers library, and prompt each of the models with generation hyperaparameters guidance_scale=16.0 and num_inference_steps=50. In the case of NSFW image generation, we update the seed by 1 and regenerate an image up to 10 times. Effectively, all unit tests have a corresponding image. We use the following implementations: CompVis/stable-diffusion-v1-4 for SDv1.4, longlian/lmd_plus for LM Guided Diffusion, and stabilityai/stable-diffusion-xl-base-1.0 for SDXL3.

D.4.1 LM Grounded Diffusion

To generate the bounding boxes and phrases for LM Grounded Diffusion we prompt meta-llama/Meta-Llama-3-8B-Instruct, executed via VLLM with the following generation parameters: temperature=1.0, top_p=0.9, top_k=0.0, max_new_tokens=320, and num_beams=1. We return 5 candidate sequences to collect multiple candidates since we notice that often the extracted phrases can be empty, leading to failure in image generation. We present the prompt and in-context examples used for this part in Code LABEL:code:lm_grounded.

Appendix E Strategies for Visual Unit Test Generation

E.1 Unit Test Sampler σ\sigma

Figure 10 illustrates the impact of different sampling strategies with varying the number of unit tests and program configurations. Our results indicate that ‘Coverage by Answer then Input’, consistently outperforms other methods. To gain deeper insights, we categorize the questions into three groups: Spatial, Attribute, and Other. For GQA, we classify any question groups containing Attr as Attribute and those mentioning location or position as Spatial. Figure 11 presents the average performance across scenarios with at least five unit tests and three program configurations. Notably, the Coverage by Answer Then Input strategy emerges as the most effective for questions in the Attribute category.

Refer to caption
Figure 10: Effect of sampling methods on performance across varying numbers of unit tests and program configurations.
Refer to caption
Figure 11: Performance of sampling methods across question categories. Results are averaged over scenarios with at least five unit tests and three program configurations.

E.2 Image Generator MM

Figure 12 shows the impact of various diffusion models across different numbers of unit tests and program configurations. Our analysis reveals that LM-Guided diffusion consistently outperforms other methods, particularly in scenarios with more programs, where the likelihood of finding a suitable program for execution is higher. To gain deeper insights, figure 11 presents the average performance across scenarios with at least three unit tests and two program configurations on the categories introduced in the previous subsection. To provide a deeper understanding, Figure 13 illustrates the average performance across scenarios involving at least three unit tests and two program configurations, focusing on the categories defined earlier. Notably, LM-Guided diffusion proves most effective for questions in the Spatial category, highlighting the advantages of more controllable generation in achieving higher spatial fidelity.

Refer to caption
Figure 12: Effect of diffusion model on performance across varying numbers of unit tests and program configurations.
Refer to caption
Figure 13: Performance of different diffusion models across question categories. Results are averaged over scenarios with at least three unit tests and two program configurations.

E.3 Scoring function hh

Figure 14 highlights the impact of error penalties across varying configurations of unit tests and programs. While their effect becomes negligible in higher-resource configurations with more programs and unit tests, error penalties prove beneficial in lower-resource settings. In these scenarios, they help prioritize the selection of executable programs, thereby improving performance. Notably, runtime error penalties are more impactful for GQA, whereas compilation error penalties play a larger role in WinoGround. This difference likely stems from the higher complexity of WinoGround programs, which are more prone to compilation errors.

Refer to caption
Figure 14: Effect of error penalties on accuracy.

E.4 Aggregate Scorer HH

Figure 15 illustrates the impact of various aggregator functions on accuracy. Among these, mean score aggregation consistently outperforms other methods, particularly in configurations with a higher number of programs. In the case of WinoGround, however, max aggregation also performs competitively, occasionally surpassing mean aggregation. This is likely due to the binary nature of the answers in WinoGround and the increased likelihood of selecting correct for incorrect reasons programs.

Refer to caption
Figure 15: Effect of aggregator function on accuracy.

Appendix F Visual Unit Test Utilization Methods

F.1 Best Program Selection

Tab 7 shows additional results on best program selection with varrying number of programs.

VQA Image-Text Matching
LLM # Prog # UT GQA Winoground SugarCREPE Avg.
Base Setup
gpt-4o-mini 1 0 42.03±1.21 44.98±0.75 38.75±0.47 41.92±0.81
CodeLlama-7B 1 0 35.99±2.94 38.83±0.45 30.54±0.99 35.12±1.46
CodeGemma-7B 1 0 41.83±2.26 39.60±1.38 42.56±1.52 41.33±1.72
Most Common Answer Setup
CodeLlama-7B 2 0 27.76±0.41 36.19±0.66 32.02±2.25 31.99±1.11
CodeLlama-7B 3 0 35.99±0.70 42.40±0.85 37.26±2.70 38.55±1.42
CodeLlama-7B 4 0 38.71±1.61 42.12±0.60 39.17±2.01 40.00±1.41
CodeLlama-7B 5 0 42.50±1.50 45.85±0.77 41.67±1.79 43.34±1.35
CodeGemma-7B 2 0 31.87±0.80 33.04±0.67 36.37±1.62 33.76±1.03
CodeGemma-7B 3 0 40.31±1.00 40.50±1.33 44.58±0.55 41.80±0.96
CodeGemma-7B 4 0 40.44±0.53 43.06±1.89 44.46±1.17 42.66±1.20
CodeGemma-7B 5 0 43.89±0.98 46.04±1.48 46.67±1.69 45.53±1.38
ViUniT Setup (Ours)
CodeLlama-7B 2 5 41.90±1.74 46.65±1.63 40.24±0.82 42.93±1.40
CodeLlama-7B 3 5 45.68±0.94 48.54±0.37 43.93±1.09 46.05±0.80
CodeLlama-7B 4 5 49.07±2.39 50.17±0.54 45.65±1.22 48.30±1.38
CodeLlama-7B 5 5 49.27±1.13 49.73±0.73 47.02±1.19 48.67±1.02
CodeGemma-7B 2 5 44.02±0.72 49.27±0.57 46.73±2.30 46.67±1.20
CodeGemma-7B 3 5 46.08±0.41 51.17±1.98 48.93±1.86 48.73±1.42
CodeGemma-7B 4 5 47.88±1.36 52.25±1.35 50.83±1.32 50.32±1.34
CodeGemma-7B 5 5 48.01±1.05 51.92±0.90 51.85±2.16 50.59±1.37
Table 7: Accuracy on Best Program Selection with varying number of programs. Bold is best.

F.2 Answer Refusal

Figure 16 shows additional statistics on answer refusal, in particular the accuracy of selecting programs that will provide the final answer and the programs that succeed on the unit tests at different thresholds.

Refer to caption
Figure 16: Accuracy and Program Pass Rate for different thereshold values for answer refusal.

F.3 Re-prompting

F.3.1 Implementation Details

We consider an application of the unit tests to generate different candidate programs if the generated program falls below a threshold. To do so, we maintain the same hyperparameters in the program generator, but adapt the prompt to include the outputs of the unit tests as well as use suitable in context examples as shown in Codes LABEL:code:viunit_reprompting_vqa and LABEL:code:viunit_reprompting_itm for VQA and ITM respectively.

Error Reprompting Baseline We employ the same model and hyperparamters as the [Uncaptioned image]reprompting, but instead adapt the prompt to take into account the error messages instead of the unit tests as shown in Codes LABEL:code:error_reprompting_vqa and LABEL:code:error_reprompting_itm for VQA and ITM respectively.

F.3.2 Additional Results

Table 8 presents the results of an additional reprompting iteration, highlighting that while [Uncaptioned image]  continues to achieve higher performance overall, there is a slight drop in accuracy compared to the previous iteration. This decline can be attributed to its attempts to refine programs that may already produce correct answers for the wrong reasons. Such corrections can inadvertently cause shifts in the generated answers, leading to decreased accuracy despite the method’s focus on improving program fidelity.

VQA Image-Text Matching
LLM Iter. # Prog # UT GQA Winoground SugarCREPE Avg.
Base Setup (Iteration = 0)
CodeLlama-7B 0 1 0 35.99±2.94 38.83±0.45 30.54±0.99 35.12±1.46
CodeGemma-7B 0 1 0 41.83±2.26 39.60±1.38 42.56±1.52 41.33±1.72
Error Reprompting
CodeLlama-7B 1 1 0 37.92±2.68 42.46±0.57 33.21±0.64 37.86±1.30
CodeLlama-7B 2 1 0 38.78±2.22 44.58±0.44 37.08±1.08 40.15±1.25
CodeGemma-7B 1 1 0 42.63±2.42 42.42±1.91 44.52±1.05 42.63±2.42
CodeGemma-7B 2 1 0 42.90±2.65 43.08±1.73 45.30±0.92 42.90±2.65
ViUniT Reprompting θ=0.7\theta=0.7 (Ours)
CodeLlama-7B 1 1 5 46.68±2.52 51.85±0.40 47.68±2.17 48.74±1.69
CodeLlama-7B 2 1 5 46.95±1.33 52.04±0.83 48.04±1.64 49.01±1.26
CodeGemma-7B 1 1 5 45.75±0.30 48.19±2.28 48.21±1.12 47.38±1.23
CodeGemma-7B 2 1 5 44.42±1.00 49.25±2.66 48.81±1.19 47.49±1.62
Table 8: Accuracy of different re-prompting methods with an additional iteration. Bold is best.

F.4 Reward Design for Reinforcement Learning

F.4.1 Implementation Details

Table 9 contains additional hyperparameters used for training. Each RL epoch requires about 30 minutes with correctness reward, and 90 minutes with [Uncaptioned image]  reward since it requires execution of unit tests.

Parameter Value
warmup_ratio 0.1
max_grad_norm 0.3
lr_scheduler_type linear
learning_rate 2e-4
lora_config.r 16
lora_config.lora_alpha 32
lora_config.lora_dropout 0.05
lora_config.bias none
lora_config.target_modules
k_proj v_proj
q_proj o_proj
Table 9: RL training hyperparameters.

F.4.2 Additional Analysis

Table 10 highlights the reduced error rates—measured as the number of programs leading to exceptions—achieved using the [Uncaptioned image]  reward. Additionally, Table 11 presents the results of cross-task and cross-dataset generalization on policies trained with GQA, following the approach of [23]. For VQAv2 [11], we sample 10 questions for each of the 50 most common answers from the validation split of the compositional subset curated by [43], similar to [23]. For OKVQA [35], we sample 10 questions per question type, resulting in a total of 110 questions. The results indicate that while both reward types demonstrate strong generalization across tasks and datasets, the [Uncaptioned image]  reward consistently delivers superior performance.

VQA Image-Text Matching
LLM # Prog # UT GQA Winoground SugarCREPE Avg.
Supervised Correctness Reward
CodeLlama-7B 1 0 15.14±7.74 8.21±1.72 20.06±3.62 14.47±4.36
CodeGemma-7B 1 0 9.10±9.35 13.25±6.30 12.86±4.41 11.73±6.69
Unsupervised ViUniT Reward (Ours)
CodeLlama-7B 1 0 9.56±2.13 10.31±1.55 15.42±3.03 11.76±2.24
CodeGemma-7B 1 0 1.99±0.91 5.81±0.49 6.25±1.02 4.68±0.80
Table 10: Comparison of Error Rates in models trained with supervised correctness rewards versus unsupervised unit-test-based rewards. Lower is better. Bold is best.
X-Dataset Generalization X-Task Generalization
LLM # Prog # UT VQAv2 OK-VQA Winoground SugarCREPE
Base Setup
CodeLlama-7B 1 0 25.67±2.20 16.09±2.02 30.54±0.99 35.12±1.46
CodeGemma-7B 1 0 36.40±1.44 27.58±2.48 42.56±1.52 41.33±1.72
Supervised Correctness Reward
CodeLlama-7B 1 0 34.33±7.82 24.12±5.98 41.02±3.05 37.14±6.48
CodeGemma-7B 1 0 42.47±6.03 28.12±6.20 47.98±4.98 39.94±11.58
Unsupervised ViUniT Reward (Ours)
CodeLlama-7B 1 0 35.87±2.31 25.64±0.91 43.63±2.89 44.35±3.18
CodeGemma-7B 1 0 44.00±4.20 36.85±3.48 51.78±0.41 49.23±2.54
Table 11: GQA policy generalization across tasks and datasets

Appendix G End-to-End Fallback Methods

G.1 Implementation Details

G.1.1 VQA

For VQA we revert to ask the query directly to Salesforce/blip2-flan-t5-xxl [28] loaded in 8-bits using BitsAndBytes with a maximum batch size of 4 and generation hyperparameters length_penalty=-1, num_beams=5, max_length=10,min_length=1,do_sample=False, top_p=0.9, repetition_penalty=1.0, and temperature=1.

G.1.2 Image-Text-Matching

For image-text-matching we revert to openai/clip-vit-large-patch14-336 [39] using 0.8 similarity threshold for positive match, and negative otherwise.

G.2 Results with Fallback Method on Exception

In this work, we report results without employing a fallback method on exceptions, treating such cases as failures to better assess the quality of programs generated by different methods. However, it is common in the literature to report accuracy with a fallback method applied on exceptions. In Table 12 we present the best program selection results using this fallback approach on error.

VQA Image-Text Matching
LLM # Prog # UT GQA Winoground SugarCREPE Avg.
Base Setup
gpt-4o-mini† 1 0 43.76±1.72 51.94±0.56 49.46±1.25 48.39±1.17
CodeLlama-7B† 1 0 44.75±2.01 51.65±1.09 48.57±0.82 48.32±1.31
CodeGemma-7B† 1 0 44.82±2.30 47.23±2.26 50.18±0.71 47.41±1.76
Most Common Answer Setup
CodeLlama-7B† 5 0 49.07±2.79 51.29±0.87 46.79±1.29 49.05±1.65
CodeGemma-7B† 5 0 46.61±1.24 49.10±1.32 49.17±1.52 48.29±1.36
ViUniT Setup (Ours)
CodeLlama-7B† 5 5 49.27±1.33 49.73±0.73 47.02±1.19 48.67±1.08
CodeGemma-7B† 5 5 48.14±1.02 51.92±0.90 51.85±2.16 50.63±1.36
Table 12: Accuracy on Best Program Selection using fallback method on exception (indicated by †). Bold is best.

Appendix H Human Evaluation

This section presents details on the human evaluations on the quality of unit tests, and program correctness. We used Google-Forms to conduct the evaluations.

H.1 Unit Test Evaluation

To assess the quality of unit tests we randomly sample 20 exampels from each of the three datasets, each corresponding to 5 unit tests, resulting in a total of 300 unit tests for evaluation. The unit tests were judged by three independent annotators, instructed with Is the answer answer correct given the image?, were answer was populated with the unit test expected answer, with binary yes/no answers. Table 13 breaks down the results showing that on average 75% of unit tests are correct. Then the annotators optionally annotated the reason of failure by selecting from “Missing Object”, “Spatial Error”, “Incomplete object”, “Color Mismatch”, or “Other”. Figure 17 shows the break down by error type, highlighting “Missing Object” as the most common source of error.

GQA WinoGround SugarCREPE Avg.
Acc. κ\kappa Acc. κ\kappa Acc. κ\kappa Acc. κ\kappa
68.00 0.39 75.00 0.70 82.00 0.67 75.00 0.58
Table 13: Human Evaluation of Unit Test Quality. Accuracy corresponds to how many unit tests from the total were accurate and κ\kappa is the mean Kohen Kappa across annotators.
Refer to caption
Figure 17: Human Evaluation of Unit Test Quality. Bars show the average number of times annotators selected a source of error.

H.2 Program Correctness Evaluation

To assess the improvements on program quality by applying [Uncaptioned image]  we conduct a human evaluation to rate GQA programs generated by the Base Setup and the programs selected from 5 candidate programs and 5 unit tests. Two annotators with 3+ years of Python experience graded programs using the following grading scheme: “Correct: The code accurately and fully answers the query.” (0), “Partially Correct: The code answers the query but has some issues.” (1), “Incorrect: The code does not answer the query correctly.” (2), and “Irrelevant: The code is unrelated to the query.” (3). In addition, they were optionally asked to select the source of error from “Missing Condition”, “Incorrect Logic”, “Irrelevant to the query”, “Wrong Conditions”, “Missing Checks (e.g. could get list index out of range)”, “Performance Issues”, “Other”. Table 14 shows the break down of program correctness improvements using [Uncaptioned image]  and Figure 18 shows the error types identified in each method. [Uncaptioned image]  has “Missing Checks” as the most common error type, which mostly involves cases of not checking array length before accessing indices, typically still leading to correct solutions with reasonable programs, whereas the main culprit for program incorrectness in the base setup is “Incorrect Logic”.

Base Setup ViUniT Setup (Ours)
Fully Correct (1\leq 1) 77% 86%
Partially Correct (<2<2) 86% 95%
Incorrect (2\geq 2) 14% 5%
Irrelevant (>2>2) 4% 0%
κ\kappa 0.24 0.30
κbin\kappa_{bin} 0.59 0.40
Table 14: Human Evaluation of Program Correctness. Bold is best.
Refer to caption
Figure 18: Human Evaluation of Program Quality.

Appendix I Limitations and Social Ethics Impact

I.1 Limitations

While [Uncaptioned image] provides significant advancements in the logical correctness and robustness of visual programs, our framework has several limitations that present opportunities for future enhancement. First, although [Uncaptioned image] improves program selection and execution by leveraging unit tests, it does not fully eliminate the issue of programs being correct for the wrong reasons, as shown by the human evaluation in Table 14. Our approach does not provide a formal guarantee of logical correctness, as it relies on automatically generated tests to evaluate candidate programs. Addressing this challenge opens avenues for integrating formal verification methods and more sophisticated testing strategies to further enhance program correctness. Second, while we optimize for maximizing input and output coverage during unit test generation, it is possible that the generated tests do not fully capture the space of edge cases or subtle logical errors in complex programs. This limitation highlights the potential for future work to develop more comprehensive coverage metrics and testing methodologies, possibly incorporating code-line execution coverage or other verifiable metrics. Third, the improved accuracy and robustness achieved by [Uncaptioned image] , as seen in Table 1, come with an increase in computational effort. Generating candidate programs, sampling unit tests, and executing them on generated images introduce additional overhead. This trade-off between accuracy and efficiency presents an exciting challenge for future research to optimize the framework for real-time or resource-constrained applications, possibly through algorithmic improvements or efficient execution strategies. Additionally, enhancing the explainability of program failures remains an area for further development. Providing clear and interpretable feedback when a program is rejected or not selected due to poor performance on unit tests can improve user trust and facilitate debugging. Future work could focus on combining unit test outputs to offer detailed explanations of program failures. Finally, while [Uncaptioned image] has demonstrated effectiveness on VQA and ITM tasks, exploring its applicability to other domains or tasks involving different modalities or reasoning paradigms presents an opportunity to extend its impact. Adapting the framework to diverse domains can unlock new possibilities and broaden its utility. Despite these limitations, the advancements introduced by [Uncaptioned image] lay a strong foundation for future innovations in visual programming. By addressing these challenges, we can further enhance the robustness, efficiency, and applicability of the framework.

I.2 Social Ethics Impact

[Uncaptioned image]

enhances the robustness and correctness of visual programming, with applications in critical domains like autonomous driving, healthcare, and education. By reducing instances where programs are correct for the wrong reasons, it helps build more trustworthy AI systems. However, ethical considerations are crucial for its responsible deployment: First, [Uncaptioned image] relies on pre-trained models, which may propagate biases (e.g., gender, racial, or cultural). Future work should focus on integrating bias detection and correction into unit test generation to promote fairness. Second, computational demands may limit access for resource-constrained organizations. Advancing efficiency and optimization can broaden accessibility and foster inclusivity. Third, increased computational needs may raise energy consumption. Optimizing for energy efficiency and using renewable energy can reduce the environmental impact, while improved AI reliability could deliver long-term sustainability benefits. Finally, in sensitive domains like healthcare or law, rigorous validation and transparency are essential. Finally, in sensitive domains such as healthcare or legal decision-making, while [Uncaptioned image] has the potential to enhance the correctness of visual programs, it is crucial to carefully communicate the framework’s limitations and ensure rigorous validation. By proactively addressing ethical challenges and focusing on responsible development, we can maximize the positive societal impact of [Uncaptioned image], paving the way for more reliable, fair, and trustworthy AI systems.

Appendix J Qualitative Examples

We present two program selection examples in Figures 19 and 20.

Refer to caption
Figure 19: Program Selection Example
Refer to caption
Figure 20: Program Selection Example
Listing 1: API Prompt
import math
class ImagePatch:
pass
def __init__(
self, image, left=None, lower=None, right=None, upper=None, category=None
):
"""Initializes an ImagePatch object by cropping the image at the given
coordinates and stores the coordinates as attributes. If no coordinates are
provided, the image is left unmodified, and the coordinates are set to the
dimensions of the image.
Parameters
-------
image : array_like
An array-like of the original image.
left, lower, right, upper : int
An int describing the position of the (left/lower/right/upper) border of the
crop’s bounding box in the original image.
category : str
A string describing the name of the object in the image."""
# Rectangles are represented as 4-tuples, (x1, y1, x2, y2),
# with the upper left corner given first. The coordinate
# system is assumed to have its origin in the upper left corner, so
# upper must be less than lower and left must be less than right.
self.left = left if left is not None else 0
self.lower = lower if lower is not None else image.height
self.right = right if right is not None else image.width
self.upper = upper if upper is not None else 0
self.cropped_image = image[:, image.shape[1]-upper:image.shape[1]-lower, left:right]
self.horizontal_center = (self.left + self.right) / 2
self.vertical_center = (self.upper + self.lower) / 2
self.category = category
def from_bounding_box(cls, image, bounding_box):
"""Initializes an ImagePatch object by cropping the image at the given
coordinates and stores the coordinates as attributes.
Parameters
-------
image : array_like
An array-like of the original image.
bounding_box : dict
A dictionary like {"box": [left, lower, right, upper], "category": str}."""
pass
@property
def area(self):
"""
Returns the area of the bounding box.
Examples
--------
>>> # What color is the largest foo?
>>> def execute_command(image) -> str:
>>> image_patch = ImagePatch(image)
>>> foo_patches = image_patch.find("foo")
>>> foo_patches.sort(key=lambda x: x.area)
>>> largest_foo_patch = foo_patches[-1]
>>> return largest_foo_patch.simple_query("What is the color?")
"""
pass
def find(self, object_name):
"""Returns a list of ImagePatch objects matching object_name contained in the
crop if any are found.
Otherwise, returns an empty list.
Parameters
----------
object_name : str
the name of the object to be found
Returns
-------
List[ImagePatch]
a list of ImagePatch objects matching object_name contained in the crop
Examples
--------
>>> # return the foo
>>> def execute_command(image) -> List[ImagePatch]:
>>> image_patch = ImagePatch(image)
>>> foo_patches = image_patch.find("foo")
>>> return foo_patches
"""
pass
def exists(self, object_name):
"""Returns True if the object specified by object_name is found in the image,
and False otherwise.
Parameters
-------
object_name : str
A string describing the name of the object to be found in the image.
Examples
-------
>>> # Are there both foos and garply bars in the photo?
>>> def execute_command(image)->str:
>>> image_patch = ImagePatch(image)
>>> is_foo = image_patch.exists("foo")
>>> is_garply_bar = image_patch.exists("garply bar")
>>> return bool_to_yesno(is_foo and is_garply_bar)
"""
pass
def verify_property(self, object_name, visual_property):
"""Returns True if the object possesses the visual property, and False otherwise.
Differs from ’exists’ in that it presupposes the existence of the object
specified by object_name, instead checking whether the object possesses
the property.
Parameters
-------
object_name : str
A string describing the name of the object to be found in the image.
visual_property : str
String describing the simple visual property (e.g., color, shape, material)
to be checked.
Examples
-------
>>> # Do the letters have blue color?
>>> def execute_command(image) -> str:
>>> image_patch = ImagePatch(image)
>>> letters_patches = image_patch.find("letters")
>>> # Question assumes only one letter patch
>>> return bool_to_yesno(letters_patches[0].verify_property("letters", "blue"))
"""
pass
def simple_query(self, question):
"""Returns the answer to a basic question asked about the image.
If no question is provided, returns the answer to "What is this?".
The questions are about basic perception, and are not meant to be used for
complex reasoning or external knowledge.
Parameters
-------
question : str
A string describing the question to be asked.
Examples
-------
>>> # Which kind of baz is not fredding?
>>> def execute_command(image) -> str:
>>> image_patch = ImagePatch(image)
>>> baz_patches = image_patch.find("baz")
>>> for baz_patch in baz_patches:
>>> if not baz_patch.verify_property("baz", "fredding"):
>>> return baz_patch.simple_query("What is this baz?")
>>> # What color is the foo?
>>> def execute_command(image) -> str:
>>> image_patch = ImagePatch(image)
>>> foo_patches = image_patch.find("foo")
>>> foo_patch = foo_patches[0]
>>> return foo_patch.simple_query("What is the color?")
>>> # Is the second bar from the left quuxy?
>>> def execute_command(image) -> str:
>>> image_patch = ImagePatch(image)
>>> bar_patches = image_patch.find("bar")
>>> bar_patches.sort(key=lambda x: x.horizontal_center)
>>> bar_patch = bar_patches[1]
>>> return bar_patch.simple_query("Is the bar quuxy?")
"""
pass
def crop_left_of_bbox(self, left, lower, right, upper):
"""Returns an ImagePatch object representing the area to the left of the given
bounding box coordinates.
Parameters
----------
left, lower, right, upper : int
The coordinates of the bounding box.
Returns
-------
ImagePatch
An ImagePatch object representing the cropped area.
Examples
--------
>>> # Is the bar to the left of the foo quuxy?
>>> def execute_command(image) -> str:
>>> image_patch = ImagePatch(image)
>>> foo_patch = image_patch.find("foo")[0]
>>> left_of_foo_patch = image_patch.crop_left_of_bbox(
>>> foo_patch.left, foo_patch.lower, foo_patch.right, foo_patch.upper
>>> )
>>> return bool_to_yesno(left_of_foo_patch.verify_property("bar", "quuxy"))
"""
pass
def crop_right_of_bbox(self, left, lower, right, upper):
"""Returns an ImagePatch object representing the area to the right of the given
bounding box coordinates.
Parameters
----------
left, lower, right, upper : int
The coordinates of the bounding box.
Returns
-------
ImagePatch
An ImagePatch object representing the cropped area.
Examples
--------
>>> # Is the bar to the right of the foo quuxy?
>>> def execute_command(image) -> str:
>>> image_patch = ImagePatch(image)
>>> foo_patch = image_patch.find("foo")[0]
>>> right_of_foo_patch = image_patch.crop_right_of_bbox(
>>> foo_patch.left, foo_patch.lower, foo_patch.right, foo_patch.upper
>>> )
>>> return bool_to_yesno(right_of_foo_patch.verify_property("bar", "quuxy"))
"""
pass
def crop_below_bbox(self, left, lower, right, upper):
"""Returns an ImagePatch object representing the area below the given
bounding box coordinates.
Parameters
----------
left, lower, right, upper : int
The coordinates of the bounding box.
Returns
-------
ImagePatch
An ImagePatch object representing the cropped area.
Examples
--------
>>> # Is the bar below the foo quuxy?
>>> def execute_command(image) -> str:
>>> image_patch = ImagePatch(image)
>>> foo_patch = image_patch.find("foo")[0]
>>> below_foo_patch = image_patch.crop_below_bbox(
>>> foo_patch.left, foo_patch.lower, foo_patch.right, foo_patch.upper
>>> )
>>> return bool_to_yesno(below_foo_patch.verify_property("bar", "quuxy"))
"""
pass
def crop_above_bbox(self, left, lower, right, upper):
"""Returns an ImagePatch object representing the area above the given
bounding box coordinates.
Parameters
----------
left, lower, right, upper : int
The coordinates of the bounding box.
Returns
-------
ImagePatch
An ImagePatch object representing the cropped area.
Examples
--------
>>> # Is the bar above the foo quuxy?
>>> def execute_command(image) -> str:
>>> image_patch = ImagePatch(image)
>>> foo_patch = image_patch.find("foo")[0]
>>> above_foo_patch = image_patch.crop_above_bbox(
>>> foo_patch.left, foo_patch.lower, foo_patch.right, foo_patch.upper
>>> )
>>> return bool_to_yesno(above_foo_patch.verify_property("bar", "quuxy"))
"""
pass
def best_image_match(list_patches: List[ImagePatch], content: List[str], return_index=False) -> Union[ImagePatch, int]:
"""Returns the patch most likely to contain the content.
Parameters
----------
list_patches : List[ImagePatch]
content : List[str]
the object of interest
return_index : bool
if True, returns the index of the patch most likely to contain the object
Returns
-------
int
Patch most likely to contain the object
"""
return best_image_match(list_patches, content, return_index)
def bool_to_yesno(bool_answer: bool) -> str:
return "yes" if bool_answer else "no"
Write a function using Python and the ImagePatch class (above) that could be executed to provide an answer to the query.
Consider the following guidelines:
- Use base Python (comparison, sorting) for basic logical operations, left/right/up/down, math, etc.
# Examples of how to use the API
INSERT_CONTEXT_HERE
Query: INSERT_QUERY_HERE
Program:
Listing 2: ITM In-Context Examples
# Query: Verify image matches text="An airplane is flying in the sky, and birds are flying below it."
def execute_command(image) -> str:
image_patch = ImagePatch(image)
airplane_patches = image_patch.find("airplane")
bird_patches = image_patch.find("bird")
airplane_in_sky = any(
airplane_patch.vertical_center > image_patch.height * 0.6
for airplane_patch in airplane_patches
)
birds_below_airplane = any(
bird_patch.upper <= airplane_patch.lower
for bird_patch in bird_patches for airplane_patch in airplane_patches
)
return bool_to_yesno(airplane_in_sky and birds_below_airplane)
# Query: Verify image matches text="The bird is flying above the tree, and a cat is sitting under the tree."
def execute_command(image) -> str:
image_patch = ImagePatch(image)
bird_patches = image_patch.find("bird")
tree_patches = image_patch.find("tree")
cat_patches = image_patch.find("cat")
bird_above_tree = any(
bird_patch.lower >= tree_patch.upper and
abs(bird_patch.horizontal_center - tree_patch.horizontal_center) < 50
for bird_patch in bird_patches for tree_patch in tree_patches
)
cat_under_tree = any(
cat_patch.upper <= tree_patch.lower and
abs(cat_patch.horizontal_center - tree_patch.horizontal_center) < 50
for cat_patch in cat_patches for tree_patch in tree_patches
)
return bool_to_yesno(bird_above_tree and cat_under_tree)
# Query: Verify image matches text="The apple is on top of the book, and the pen is beside the book."
def execute_command(image) -> str:
image_patch = ImagePatch(image)
apple_patches = image_patch.find("apple")
book_patches = image_patch.find("book")
pen_patches = image_patch.find("pen")
apple_on_book = any(
apple_patch.lower >= book_patch.upper and
book_patch.left <= apple_patch.horizontal_center <= book_patch.right
for apple_patch in apple_patches for book_patch in book_patches
)
pen_beside_book = any(
abs(pen_patch.horizontal_center - book_patch.horizontal_center) < 50 and
abs(pen_patch.vertical_center - book_patch.vertical_center) < 100
for pen_patch in pen_patches for book_patch in book_patches
)
return bool_to_yesno(apple_on_book and pen_beside_book)
#Query: Verify image matches text="A man is riding a bicycle, and a dog is running beside him."
def execute_command(image) -> str:
image_patch = ImagePatch(image)
man_patches = image_patch.find("man")
bicycle_patches = image_patch.find("bicycle")
dog_patches = image_patch.find("dog")
man_on_bicycle = any(
man_patch.left <= bicycle_patch.right and man_patch.right >= bicycle_patch.left and
man_patch.lower <= bicycle_patch.upper and man_patch.upper >= bicycle_patch.lower
for man_patch in man_patches for bicycle_patch in bicycle_patches
)
dog_beside_man = any(
abs(dog_patch.horizontal_center - man_patch.horizontal_center) < 100 and
abs(dog_patch.vertical_center - man_patch.vertical_center) < 50
for dog_patch in dog_patches for man_patch in man_patches
)
return bool_to_yesno(man_on_bicycle and dog_beside_man)
Listing 3: VQA In-Context Examples
# Query: Is the vehicle in the top of the image?
def execute_command(image) -> str:
image_patch = ImagePatch(image)
# Assume there’s only one vehicle patch.
vehicle_patch = image_patch.find("vehicle")[0]
vehicle_in_top_half = vehicle_patch.vertical_center > image_patch.vertical_center
return bool_to_yesno(vehicle_in_top_half)
# Query: Are there trains or fences in this scene?
def execute_command(image) -> str:
image_patch = ImagePatch(image)
trains = image_patch.find("train")
fences = image_patch.find("fence")
has_trains_or_fences = len(trains) > 0 or len(fences) > 0
return bool_to_yesno(has_trains_or_fences)
# Query: Is the pillow in the top part or in the bottom of the picture?
def execute_command(image) -> str:
image_patch = ImagePatch(image)
pillow_patches = image_patch.find("pillow")
pillow_patch = pillow_patches[0]
pillow_in_top_half = pillow_patch.vertical_center > image_patch.vertical_center
if pillow_in_top_half:
return "top"
else:
return "bottom"
# Query: What color is the curtain that is to the right of the mirror?
def execute_command(image) -> str:
image_patch = ImagePatch(image)
mirror_patches = image_patch.find("mirror")
mirror_patch = mirror_patches[0]
right_of_mirror_patch = image_patch.crop_right_of_bbox(
mirror_patch.left, mirror_patch.lower, mirror_patch.right, mirror_patch.upper
)
return right_of_mirror_patch.simple_query("What color is the curtain?")
Listing 4: Reprompting with Unit Tests VQA
INSERT_IMAGE_PATCH_API
You are provided a Python program that answers a query about an image, with a set of tests with the corresponding outputs and exected responses.
Correct the Python program such that it passes the tests.
- Ensure the corrected program is different than the incorrect program provided.
Query: Is there a blue chair in the image?
Incorrect Program:
def execute_command(image):
image_patch = ImagePatch(image)
blue_chair = image_patch.find("chair")
if not blue_chair:
return "No"
is_blue = any([chair.verify_property("blue") for chair in blue_chair])
return "Yes" if is_blue else "No"
Test Cases:
Test A
Image Content: "A room with a red chair"
Ground Truth Answer: "No"
Program Output: "Error: verify_property() missing 1 required positional argument: ’visual_property’"
Test B
Image Content: "A room with a blue chair under the window"
Ground Truth Answer: "Yes"
Program Output: "Error: verify_property() missing 1 required positional argument: ’visual_property’"
Test C
Image Content: "An empty room"
Ground Truth Answer: "No"
Program Output: "No"
Test D
Image Content: "A garden with a blue chair"
Ground Truth Answer: "Yes"
Program Output: "Error: verify_property() missing 1 required positional argument: ’visual_property’"
Test E
Image Content: "A room with several chairs, all red"
Ground Truth Answer: "No"
Program Output: "Error: verify_property() missing 1 required positional argument: ’visual_property’"
Corrected Program:
def execute_command(image):
image_patch = ImagePatch(image)
chair_patches = image_patch.find("chair")
if not chair_patches:
return "No" # No chairs found
blue_chair_found = any(chair.verify_property("chair", "blue") for chair in chair_patches)
return "Yes" if blue_chair_found else "No"
Query: "Are there any flowers to the left of the house?"
Incorrect Program:
def execute_command(image):
image_patch = ImagePatch(image)
house_patches = image_patch.find("house")
if not house_patches:
return "No house found"
left_of_house_patch = image_patch.crop_left_of_bbox(
house_patches.left, house_patches.lower, house_patches.right, house_patches.upper
) # Incorrect attribute access
return "Yes" if left_of_house_patch.exists("flower") else "No"
Test Cases:
Test A
Image Content: "An image of a garden without any buildings."
Ground Truth Answer: "No house found"
Program Output: "Error: ’list’ object has no attribute ’left’"
Test B
Image Content: "A house without a garden"
Ground Truth Answer: "No flowers found"
Program Output: "Error: ’list’ object has no attribute ’left’"
Test C
Image Content: "A house with many flowers around"
Ground Truth Answer: "Yes"
Program Output: "Error: ’list’ object has no attribute ’left’"
Test D
Image Content: "A house with flowers only on the right side"
Ground Truth Answer: "No"
Program Output: "Error: ’list’ object has no attribute ’left’"
Test E
Image Content: "An image with flowers but no house"
Ground Truth Answer: "No house found"
Program Output: "Error: ’list’ object has no attribute ’left’"
Corrected Program:
def execute_command(image):
image_patch = ImagePatch(image)
house_patches = image_patch.find("house")
if not house_patches:
return "No house found"
for house_patch in house_patches:
left_of_house_patch = image_patch.crop_left_of_bbox(
house_patch.left, house_patch.lower, house_patch.right, house_patch.upper
)
flowers_found = left_of_house_patch.find("flower")
if flowers_found:
return "Yes"
return "No"
Query: Who wears a green shirt?
Incorrect Program:
def execute_command(image):
image_patch = ImagePatch(image)
people_patches = image_patch.find("person")
if not people_patches:
return "No one"
person_wearing_green_shirt = None
for index, person_patch in enumerate(people_patches):
green_patches = person_patch.find("green")
if green_patches:
person_wearing_green_shirt = index
break
if person_wearing_green_shirt == None:
return "No one"
else:
return people_patches[person_wearing_green_shirt].simple_query("Who is this?")
Test Cases:
Test A
Image Content: "An image of an empty room"
Ground Truth Answer: "No one"
Program Output: "No one"
Test B
Image Content: "A young girl wearing a green dress and a boy wearing a blue shirt."
Ground Truth Answer: "No one"
Program Output: "girl"
Test C
Image Content: "A man wearing a red shirt and a woman wearing a green shirt."
Ground Truth Answer: "woman"
Program Output: "woman"
Test D
Image Content: "A boy wearing a green shirt."
Ground Truth Answer: "boy"
Program Output: "boy"
Test E
Image Content: "Two people wearing green shirts: a man and a woman"
Ground Truth Answer: "man and woman"
Program Output: "man"
Corrected Program:
def execute_command(image):
image_patch = ImagePatch(image)
people_patches = image_patch.find("person")
if not people_patches:
return "No people found"
people_wearing_green_shirts = []
for index, person_patch in enumerate(people_patches):
if person_patch.verify_property("clothing", "shirt") and person_patch.verify_property("color", "green"):
people_wearing_green_shirts.append(index)
if not people_wearing_green_shirts:
return "No one"
wearing_green_shirts = and .join([people_patches[i].simple_query("Who is this?") for i in people_wearing_green_shirts])
return wearing_green_shirts
Query: "Is the blue ball inside the box?"
Incorrect Program:
def execute_command(image):
image_patch = ImagePatch(image)
ball_patches = image_patch.find("ball")
if not ball_patches:
return "No"
blue_ball = [ball for ball in ball_patches if ball.verify_property("color", "blue")]
if not blue_ball:
return "No" # No blue ball found
box_patches = image_patch.find("box")
if not box_patches:
return "No"
return "Yes"
Test Cases:
Test A
Image Content: "A blue ball is outside a box"
Ground Truth Answer: "No"
Program Output: "Yes"
Test B
Image Content: "A red ball is inside a box"
Ground Truth Answer: "No"
Program Output: "No"
Test C
Image Content: "A blue ball is inside a box"
Ground Truth Answer: "Yes"
Program Output: "Yes"
Test D
Image Content: "No balls or boxes in the image"
Ground Truth Answer: "No"
Program Output: "No"
Test E
Image Content: "Multiple blue balls, all outside boxes"
Ground Truth Answer: "No"
Program Output: "Yes"
Corrected Program:
def execute_command(image):
image_patch = ImagePatch(image)
ball_patches = image_patch.find("ball")
if not ball_patches:
return "No" # No ball found
blue_ball = [ball for ball in ball_patches if ball.verify_property("color", "blue")]
if not blue_ball:
return "No" # No blue ball found
box_patches = image_patch.find("box")
if not box_patches:
return "No" # No box found
blue_ball_patch = blue_ball[0]
for box_patch in box_patches:
if (box_patch.left <= blue_ball_patch.left and
box_patch.right >= blue_ball_patch.right and
box_patch.upper <= blue_ball_patch.upper and
box_patch.lower >= blue_ball_patch.lower):
return "Yes"
return "No"
Query: INSERT_QUERY_HERE
Incorrect Program:
INSERT_CODE_HERE
Test Cases:
INSERT_UNIT_TEST_OUTPUTS_HERE
Corrected Program:
Listing 5: Reprompting with Unit Tests ITM
INSERT_IMAGE_PATCH_API
You are provided a Python program that answers a query about an image, with a set of tests with the corresponding outputs and exected responses.
Correct the Python program such that it passes the tests.
- Ensure the corrected program is different than the incorrect program provided.
Query: "Verify image matches text="An airplane is flying in the sky, and birds are flying below it.""
Incorrect Program:
def execute_command(image):
image_patch = ImagePatch(image)
airplane = image_patch.find("airplane")
birds = image_patch.find("birds")
if not airplane or not birds:
return "No"
if airplane[0].vertical_center >= birds[0].vertical_center:
return "Yes"
return "No"
Test Cases:
Test A
Image Content: "An airplane flying high in the sky with birds below it."
Ground Truth Answer: "Yes"
Program Output: "Yes"
Test B
Image Content: "Birds are flying above and below an airplane in the sky."
Ground Truth Answer: "No"
Program Output: "Yes"
Test C
Image Content: "An airplane and birds flying side by side."
Ground Truth Answer: "No"
Program Output: "Yes"
Test D
Image Content: "Only an airplane is flying in the sky."
Ground Truth Answer: "No"
Program Output: "No"
Test E
Image Content: "Birds flying in the sky with no airplane present."
Ground Truth Answer: "No"
Program Output: "No"
Corrected Program::
def execute_command(image):
image_patch = ImagePatch(image)
airplane_patches = image_patch.find("airplane")
bird_patches = image_patch.find("bird")
if not airplane_patches or not bird_patches:
return "No"
airplane = airplane_patches[0]
birds_below = all(bird.vertical_center > airplane.vertical_center for bird in bird_patches)
return "Yes" if birds_below else "No"
Query: "Verify image matches text="The bird is flying above the tree, and a cat is sitting under the tree.""
Incorrect Program:
def execute_command(image):
image_patch = ImagePatch(image)
tree = image_patch.find("tree")
bird = image_patch.find("bird")
cat = image_patch.find("cat")
if not tree or not bird or not cat:
return "No"
if bird[0].vertical_center < tree[0].vertical_center and cat[0].vertical_center > tree[0].vertical_center:
return "Yes"
return "No"
Test Cases:
Test A
Image Content: "A bird flying above a tree and a cat under the tree."
Ground Truth Answer: "Yes"
Program Output: "Yes"
Test B
Image Content: "A cat sitting above the tree and a bird flying below it."
Ground Truth Answer: "No"
Program Output: "Yes"
Test C
Image Content: "A bird sitting in the tree with no cat around."
Ground Truth Answer: "No"
Program Output: "No"
Test D
Image Content: "A cat climbing the tree while a bird flies overhead."
Ground Truth Answer: "No"
Program Output: "Yes"
Test E
Image Content: "A bird flying above a tree with a dog under the tree."
Ground Truth Answer: "No"
Program Output: "No"
Corrected Program:
def execute_command(image):
image_patch = ImagePatch(image)
tree_patches = image_patch.find("tree")
bird_patches = image_patch.find("bird")
cat_patches = image_patch.find("cat")
if not tree_patches or not bird_patches or not cat_patches:
return "No"
tree = tree_patches[0]
bird_above = all(bird.vertical_center < tree.vertical_center for bird in bird_patches)
cat_below = all(cat.vertical_center > tree.vertical_center for cat in cat_patches)
return "Yes" if bird_above and cat_below else "No"
Query: "Verify image matches text="A car is parked near a tree, and a bird is sitting on the tree.""
Incorrect Program:
def execute_command(image):
image_patch = ImagePatch(image)
car = image_patch.find("car")
tree = image_patch.find("tree")
bird = image_patch.find("bird")
if not car or not tree or not bird:
return "No"
if car.horizontal_center - tree.horizontal_center < 100 and bird.vertical_center < tree.vertical_center:
return "Yes"
return "No"
Test Cases:
Test A
Image Content: "A car parked near a tree with a bird sitting on it."
Ground Truth Answer: "Yes"
Program Output: AttributeError: ’list’ object has no attribute ’horizontal_center’
Test B
Image Content: "A car far from a tree with a bird on the ground."
Ground Truth Answer: "No"
Program Output: AttributeError: ’list’ object has no attribute ’horizontal_center’
Test C
Image Content: "A tree with a bird on it but no car nearby."
Ground Truth Answer: "No"
Program Output: "No"
Test D
Image Content: "A car parked near a tree with no bird in sight."
Ground Truth Answer: "No"
Program Output: AttributeError: ’list’ object has no attribute ’horizontal_center’
Test E
Image Content: "A car and a bird but no tree present."
Ground Truth Answer: "No"
Program Output: AttributeError: ’list’ object has no attribute ’horizontal_center’
Corrected Program:
def execute_command(image):
image_patch = ImagePatch(image)
car_patches = image_patch.find("car")
tree_patches = image_patch.find("tree")
bird_patches = image_patch.find("bird")
if not car_patches or not tree_patches or not bird_patches:
return "No"
car = car_patches[0]
tree = tree_patches[0]
bird = bird_patches[0]
car_near_tree = abs(car.horizontal_center - tree.horizontal_center) < 100
bird_on_tree = bird.vertical_center < tree.vertical_center
return "Yes" if car_near_tree and bird_on_tree else "No"
Query: "Verify image matches text="A man is holding a red balloon, and a child is reaching up to grab it.""
Incorrect Program:
def execute_command(image):
image_patch = ImagePatch(image)
man = image_patch.find("man")
balloon = image_patch.find("balloon")
child = image_patch.find("child")
if not man or not balloon or not child:
return "No"
if balloon[0].verify_property("red") and child[0].vertical_center < balloon[0].vertical_center:
return "Yes"
return "No"
Test Cases:
Test A
Image Content: "A man holding a red balloon, with a child reaching up."
Ground Truth Answer: "Yes"
Program Output: TypeError: verify_property() missing 1 required positional argument: ’visual_property’
Test B
Image Content: "A man holding a blue balloon, with a child below him."
Ground Truth Answer: "No"
Program Output: TypeError: verify_property() missing 1 required positional argument: ’visual_property’
Test C
Image Content: "A man holding a flower, with a child next to him."
Ground Truth Answer: "No"
Program Output: "No"
Corrected Program:
def execute_command(image):
image_patch = ImagePatch(image)
man_patches = image_patch.find("man")
balloon_patches = image_patch.find("balloon")
child_patches = image_patch.find("child")
if not man_patches or not balloon_patches or not child_patches:
return "No"
balloon = balloon_patches[0]
is_red_balloon = balloon.verify_property("balloon", "red")
child_below_balloon = all(child.vertical_center < balloon.vertical_center for child in child_patches)
return "Yes" if is_red_balloon and child_below_balloon else "No"
Query: "Verify image matches text="A cat is sitting on the table, and a book is lying beside it.""
Incorrect Program:
def execute_command(image):
image_patch = ImagePatch(image)
cat = image_patch.find("cat")
book = image_patch.find("book")
if not cat or not book:
return "No"
if abs(book[0].horizontal_center - cat[0].horizontal_center) < 50:
return "Yes"
return "No"
Test Cases:
Test A
Image Content: "A cat sitting on the table with a book beside it."
Ground Truth Answer: "Yes"
Program Output: "Yes"
Test B
Image Content: "A cat sitting on the floor with a book beside it."
Ground Truth Answer: "No"
Program Output: "Yes"
Test C
Image Content: "A cat sitting on the table with no book around."
Ground Truth Answer: "No"
Program Output: "No"
Test D
Image Content: "A book lying on the table with no cat in sight."
Ground Truth Answer: "No"
Program Output: "No"
Test E
Image Content: "A cat sitting on the table with a book on the floor."
Ground Truth Answer: "No"
Program Output: "Yes"
Corrected Program:
def execute_command(image):
image_patch = ImagePatch(image)
cat_patches = image_patch.find("cat")
book_patches = image_patch.find("book")
table_patches = image_patch.find("table")
if not cat_patches or not book_patches or not table_patches:
return "No"
cat = cat_patches[0]
book = book_patches[0]
table = table_patches[0]
is_cat_on_table = cat.vertical_center < table.vertical_center and abs(cat.horizontal_center - table.horizontal_center) < 50
is_book_beside_cat = abs(book.horizontal_center - cat.horizontal_center) < 50
return "Yes" if is_cat_on_table and is_book_beside_cat else "No"
Query: INSERT_QUERY_HERE
Incorrect Program:
INSERT_CODE_HERE
Test Cases:
INSERT_UNIT_TEST_OUTPUTS_HERE
Corrected Program:
Listing 6: VQA Unit Test Generation In Context Examples
Query: Is there a cat or dog in the image?
Tests:
1. Image Caption: "A grey tabby cat peacefully napping on a plush sofa" Answer: yes
2. Image Caption: "A lively golden retriever bounding across a grassy field in the park" Answer: yes
3. Image Caption: "Twin Siamese cats playfully swatting at a bright yellow ball" Answer: yes
4. Image Caption: "A cluster of wild horses trotting along the sandy shores of a sunlit beach" Answer: no
5. Image Caption: "An orange cat and a black Labrador playfully tugging on a rope toy" Answer: yes
6. Image Caption: "A modern living room featuring sleek furniture and devoid of any pets" Answer: no
Query: Is there a red truck or bus in the image?
Tests:
1. Image Caption: "A vibrant red Ford pickup parked beside a country road" Answer: yes
2. Image Caption: "A red double-decker bus navigating through a busy downtown street" Answer: yes
3. Image Caption: "A large blue semi-truck cruising down an interstate highway" Answer: no
4. Image Caption: "A quiet suburban street devoid of any large vehicles like buses or trucks" Answer: no
5. Image Caption: "A shiny red Ferrari speeding on a professional race track" Answer: no
6. Image Caption: "An array of red delivery trucks lined up in a distribution center parking lot" Answer: yes
7. Image Caption: "Several bright yellow school buses parked in a row at a local school" Answer: no
Query: What color is the largest car in the image?
Tests:
1. Image Caption: "A large blue Ford pickup truck driving on a busy highway" Answer: blue
2. Image Caption: "A city street empty of any large vehicles like buses or trucks" Answer: no answer
3. Image Caption: "A row of green food trucks serving lunch in an urban park" Answer: green
4. Image Caption: "A scene with a green public bus next to a smaller blue pickup at an intersection" Answer: green
Query: Is the vase to the left or right of the center?
Tests:
1. Image Caption: "A delicate porcelain vase positioned on the right end of a mahogany dining table" Answer: right
2. Image Caption: "A tall glass vase sitting on the left side of a neatly made bed in a sunlit room" Answer: left
3. Image Caption: "A ceramic vase centrally placed on a round table surrounded by chairs" Answer: center
Query: What is the highest object in the image?
Tests:
1. Image Caption: "A massive skyscraper dominating the skyline among lower city buildings" Answer: skyscraper
2. Image Caption: "A lone oak tree surpassing the height of the cottage it stands next to" Answer: tree
3. Image Caption: "Colorful balloons drifting above the treetops in a clear sky" Answer: balloons
4. Image Caption: "A commercial jet flying high above the city’s tallest skyscrapers" Answer: plane
5. Image Caption: "A majestic eagle soaring high above a vast canyon landscape" Answer: eagle
6. Image Caption: "A figure standing on the peak of a grassy hill under a blue sky" Answer: person
Query: INSERT_QUERY_HERE
Tests:
Listing 7: ITM Unit Test Generation In Context Examples
Query: Is the drawing of a tree on the hill, and a river that flows at the bottom of the hill?
Tests:
1. Image Caption: "A solitary tree stands atop a gentle hill, with a flowing river winding below it." Answer: yes
2. Image Caption: "A tree on a grassy hill under a clear sky." Answer: no
3. Image Caption: "A river meandering through a dense forest of tall trees." Answer: no
4. Image Caption: "A panoramic view of rolling hills in the desert, with a river at the bottom." Answer: no
5. Image Caption: "A vast plain with a river running through fields of wildflowers." Answer: no
6. Image Caption: Image Caption: "A hill with multiple trees and a river flowing nearby." Answer: yes
Query: Is the drawing of an airplane flying in the sky, and birds flying below it?
Tests:
1. Image Caption: "An airplane soars through the sky, with a flock of birds flying beneath it." Answer: yes
2. Image Caption: "Birds flying over a tranquil lake under a clear sky." Answer: no
3. Image Caption: "An airplane performing aerobatic maneuvers, with birds flying above it." Answer: no
4. Image Caption: "An airplane floating in the sea with birds flying above it." Answer: Yes
5. Image Caption: "An airplane in a clear sky" Answer: no
Query: Is the drawing of a girl holding an umbrella in the rain?
Tests:
1. Image Caption: "A girl holding an umbrella walks through a rainy street." Answer: yes
2. Image Caption: "A girl holds an umbrella under a bright sun in the park." Answer: no
3. Image Caption: "A girl stands in the rain wearing a colorful raincoat and holding flowers." Answer: no
4. Image Caption: "A girl walks her dog while holding an umbrella on a rainy day." Answer: yes
Query: Is the drawing of a person sitting at a desk with a computer monitor in front of them?
Tests:
1. Image Caption: "A person sitting at a desk, writing in a notebook with a lamp beside them." Answer: no
2. Image Caption: "Someone sitting at a desk cluttered with papers and a computer monitor." Answer: yes
3. Image Caption: "Someone sitting at a desk cluttered with papers and a computer monitor." Answer: yes
3. Image Caption: "A person with a big computer screen in the background" Answer: no
Query: Is the drawing of a man riding a bicycle, and a dog running beside him?
Tests:
1. Image Caption: "A man cycling alone on a mountain trail surrounded by trees." Answer: no
2. Image Caption: "A man rides a bicycle along the beach, his dog running beside him." Answer: yes
3. Image Caption: "A bicycle and a dog" Answer: no
4. Image Caption: "A dog next to a car" Answer: no
5. Image Caption: "A man walking his dog" Answer: no
6. Image Caption: "A man rides a bicycle down a sunny street with a dog running beside him." Answer: yes
Query: INSERT_QUERY_HERE
Tests:
Listing 8: VQA Unit Test Generation with Implementation In-Context Examples
# Query: Is there a cat or dog in the image?
def execute_command(image) -> str:
image_patch = ImagePatch(image)
cats = image_patch.find("cat")
dogs = image_patch.find("dog")
has_cats_or_dogs = len(cats) > 0 or len(dogs) > 0
return bool_to_yesno(has_cats_or_dogs)
Tests:
1. Image Caption: "A grey tabby cat peacefully napping on a plush sofa" Answer: yes
2. Image Caption: "A lively golden retriever bounding across a grassy field in the park" Answer: yes
3. Image Caption: "Twin Siamese cats playfully swatting at a bright yellow ball" Answer: yes
4. Image Caption: "A cluster of wild horses trotting along the sandy shores of a sunlit beach" Answer: no
5. Image Caption: "An orange cat and a black Labrador playfully tugging on a rope toy" Answer: yes
6. Image Caption: "A modern living room featuring sleek furniture and devoid of any pets" Answer: no
# Query: Is there a red truck or bus in the image?
def execute_command(image) -> str:
image_patch = ImagePatch(image)
trucks = image_patch.find("truck")
buses = image_patch.find("bus")
red_trucks = [truck for truck in trucks if truck.verify_property("truck", "red")]
red_buses = [bus for bus in buses if bus.verify_property("bus", "red")]
has_red_trucks_or_buses = len(red_trucks) > 0 or len(red_buses) > 0
return bool_to_yesno(has_red_trucks_or_buses)
Tests:
1. Image Caption: "A vibrant red Ford pickup parked beside a country road" Answer: yes
2. Image Caption: "A red double-decker bus navigating through a busy downtown street" Answer: yes
3. Image Caption: "A large blue semi-truck cruising down an interstate highway" Answer: no
4. Image Caption: "A quiet suburban street devoid of any large vehicles like buses or trucks" Answer: no
5. Image Caption: "A shiny red Ferrari speeding on a professional race track" Answer: no
6. Image Caption: "An array of red delivery trucks lined up in a distribution center parking lot" Answer: yes
7. Image Caption: "Several bright yellow school buses parked in a row at a local school" Answer: no
# Query: What color is the largest car in the image?
def execute_command(image) -> str:
image_patch = ImagePatch(image)
car_patches = image_patch.find("car")
if not car_patches:
return "No cars found in the image."
# Sort cars by their area to find the largest one
car_patches.sort(key=lambda x: x.area, reverse=True)
largest_car_patch = car_patches[0]
color_of_largest_car = largest_car_patch.simple_query("What is the color?")
return color_of_largest_car
Tests:
1. Image Caption: "A large blue Ford pickup truck driving on a busy highway" Answer: blue
2. Image Caption: "A city street empty of any large vehicles like buses or trucks" Answer: no answer
3. Image Caption: "A row of green food trucks serving lunch in an urban park" Answer: green
4. Image Caption: "A scene with a green public bus next to a smaller blue pickup at an intersection" Answer: green
# Query: Is the vase to the left or right of the center?
def execute_command(image) -> str:
image_patch = ImagePatch(image)
vase_patches = image_patch.find("vase")
if not vase_patches:
return "No vases found in the image."
vase_patch = vase_patches[0]
vase_position = vase_patch.horizontal_center
image_center = (image_patch.left + image_patch.right) / 2
if vase_position < image_center:
return "left"
elif vase_position > image_center:
return "right"
else:
return "center"
Tests:
1. Image Caption: "A delicate porcelain vase positioned on the right end of a mahogany dining table" Answer: right
2. Image Caption: "A tall glass vase sitting on the left side of a neatly made bed in a sunlit room" Answer: left
3. Image Caption: "A ceramic vase centrally placed on a round table surrounded by chairs" Answer: center
# Query: What is the highest object in the image?
def execute_command(image) -> str:
image_patch = ImagePatch(image)
possible_objects = ["car", "tree", "building", "person", "vase", "animal", "vehicle", "furniture"]
all_patches = []
for obj in possible_objects:
all_patches.extend(image_patch.find(obj))
if not all_patches:
return "No objects found in the image."
highest_patch = max(all_patches, key=lambda x: x.upper)
highest_object_name = highest_patch.simple_query("What is this?")
return highest_object_name
Tests:
1. Image Caption: "A massive skyscraper dominating the skyline among lower city buildings" Answer: skyscraper
2. Image Caption: "A lone oak tree surpassing the height of the cottage it stands next to" Answer: tree
3. Image Caption: "Colorful balloons drifting above the treetops in a clear sky" Answer: balloons
4. Image Caption: "A commercial jet flying high above the city’s tallest skyscrapers" Answer: plane
5. Image Caption: "A majestic eagle soaring high above a vast canyon landscape" Answer: eagle
6. Image Caption: "A figure standing on the peak of a grassy hill under a blue sky" Answer: person
Create test cases for the specified query and program using the format provided in the examples.
The test cases should consist of image captions and answers to the query.
The answers should be consice, limited to a single word.
Query: INSERT_QUERY_HERE
Program:
INSERT_PROGRAM_HERE
Tests:
Listing 9: Example Code
I will provide you with a caption for a photo, image, or painting.
Your task is to generate the bounding boxes for the objects mentioned in the caption, along with a background prompt describing the scene.
The images are of size 512x512. The top-left corner has coordinate [0, 0].
The bottom-right corner has coordinnate [512, 512].
The bounding boxes should not overlap or go beyond the image boundaries.
Each bounding box should be in the format of (object name, [top-left x coordinate, top-left y coordinate, box width, box height]) and should not include more than one object.
Do not put objects that are already provided in the bounding boxes into the background prompt. Do not include non-existing or excluded objects in the background prompt.
Use "A realistic scene" as the background prompt if no background is given in the prompt. If needed, you can make reasonable guesses.
Please refer to the example below for the desired format.
Caption: A realistic image of landscape scene depicting a green car parking on the left of a blue truck, with a red air balloon and a bird in the sky
Objects: [(’a green car’, [21, 281, 211, 159]), (’a blue truck’, [269, 283, 209, 160]), (’a red air balloon’, [66, 8, 145, 135]), (’a bird’, [296, 42, 143, 100])]
Background prompt: A realistic landscape scene
Negative prompt: None
Caption: A realistic top-down view of a wooden table with two apples on it
Objects: [(’a wooden table’, [20, 148, 472, 216]), (’an apple’, [150, 226, 100, 100]), (’an apple’, [280, 226, 100, 100])]
Background prompt: A realistic top-down view
Negative prompt: None
Caption: A realistic scene of three skiers standing in a line on the snow near a palm tree
Objects: [(’a skier’, [5, 152, 139, 168]), (’a skier’, [278, 192, 121, 158]), (’a skier’, [148, 173, 124, 155]), (’a palm tree’, [404, 105, 103, 251])]
Background prompt: A realistic outdoor scene with snow
Negative prompt: None
Caption: An oil painting of a pink dolphin jumping on the left of a steam boat on the sea
Objects: [(’a steam boat’, [232, 225, 257, 149]), (’a jumping pink dolphin’, [21, 249, 189, 123])]
Background prompt: An oil painting of the sea
Negative prompt: None
Caption: A cute cat and an angry dog without birds
Objects: [(’a cute cat’, [51, 67, 271, 324]), (’an angry dog’, [302, 119, 211, 228])]
Background prompt: A realistic scene
Negative prompt: birds
Caption: Two pandas in a forest without flowers
Objects: [(’a panda’, [30, 171, 212, 226]), (’a panda’, [264, 173, 222, 221])]
Background prompt: A forest
Negative prompt: flowers
Caption: An oil painting of a living room scene without chairs with a painting mounted on the wall, a cabinet below the painting, and two flower vases on the cabinet
Objects: [(’a painting’, [88, 85, 335, 203]), (’a cabinet’, [57, 308, 404, 201]), (’a flower vase’, [166, 222, 92, 108]), (’a flower vase’, [328, 222, 92, 108])]
Background prompt: An oil painting of a living room scene
Negative prompt: chairs
Caption: INSERT_PROMPT_HERE
Objects:
Listing 10: Reprompting with Errors VQA
INSERT_IMAGE_PATCH_API
You are provided a Python program that answers a query about an image, with a set of tests with the corresponding outputs and exected responses.
Correct the Python program such that it passes the tests.
- Ensure the corrected program is different than the incorrect program provided.
Query: Is there a blue chair in the image?
Incorrect Program:
def execute_command(image):
image_patch = ImagePatch(image)
blue_chair = image_patch.find("chair")
if not blue_chair:
return "No"
is_blue = any([chair.verify_property("blue") for chair in blue_chair])
return "Yes" if is_blue else "No"
Error: verify_property() missing 1 required positional argument: ’visual_property
Corrected Program::
def execute_command(image):
image_patch = ImagePatch(image)
chair_patches = image_patch.find("chair")
if not chair_patches:
return "No" # No chairs found
blue_chair_found = any(chair.verify_property("chair", "blue") for chair in chair_patches)
return "Yes" if blue_chair_found else "No"
Query: "Are there any flowers to the left of the house?"
Incorrect Program:
def execute_command(image):
image_patch = ImagePatch(image)
house_patches = image_patch.find("house")
left_of_house_patch = image_patch.crop_left_of_bbox(
house_patches.left, house_patches.lower, house_patches.right, house_patches.upper
) # Incorrect attribute access
return "Yes" if left_of_house_patch.exists("flower") else "No"
Error: list object has no attribute left
Corrected Program:
def execute_command(image):
image_patch = ImagePatch(image)
house_patches = image_patch.find("house")
if not house_patches:
return "No house found"
house_patch = house_patches[0]
left_of_house_patch = image_patch.crop_left_of_bbox(
house_patch.left, house_patch.lower, house_patch.right, house_patch.upper
)
flowers_found = left_of_house_patch.find("flower")
return "Yes" if flowers_found else "No"
Query: Who wears a green shirt?
Incorrect Program:
def execute_command(image):
image_patch = ImagePatch(image)
# Incorrectly calling find() with an extra argument, leading to an error
people_patches = image_patch.find("person", "green")
if not people_patches:
return "No one"
people_wearing_green_shirts = []
for person_patch in people_patches:
if person_patch.verify_property("clothing", "shirt") and person_patch.verify_property("color", "green"):
people_wearing_green_shirts.append(person_patch)
if not people_wearing_green_shirts:
return "No one"
wearing_green_shirts = ’, .join([person.simple_query("Who is this?") for person in people_wearing_green_shirts])
return wearing_green_shirts
Error: find() takes 2 positional arguments but 3 were given
Corrected Program:
def execute_command(image):
image_patch = ImagePatch(image)
people_patches = image_patch.find("person")
if not people_patches:
return "No people found"
people_wearing_green_shirts = []
for index, person_patch in enumerate(people_patches):
if person_patch.verify_property("clothing", "shirt") and person_patch.verify_property("color", "green"):
people_wearing_green_shirts.append(index)
if not people_wearing_green_shirts:
return "No one"
wearing_green_shirts = ’, .join([people_patches[i].simple_query("Who is this?") for i in people_wearing_green_shirts])
return wearing_green_shirts
Query: "Is the blue ball inside the box?"
Incorrect Program:
def execute_command(image):
image_patch = ImagePatch(image)
ball_patches = image_patch.find("ball")
blue_ball = [ball for ball in ball_patches if ball.verify_property("color", "blue")]
blue_ball_left = blue_ball[0].left
box_patches = image_patch.find("box")
box_left = box_patches[0].left # Assuming there’s always a box present
if not box_patches:
return "No"
return "Yes"
Error: IndexError: list index out of range
Corrected Program:
def execute_command(image):
image_patch = ImagePatch(image)
ball_patches = image_patch.find("ball")
if not ball_patches:
return "No" # No ball found
blue_ball = [ball for ball in ball_patches if ball.verify_property("color", "blue")]
if not blue_ball:
return "No" # No blue ball found
box_patches = image_patch.find("box")
if not box_patches:
return "No" # No box found
blue_ball_patch = blue_ball[0]
for box_patch in box_patches:
if (box_patch.left <= blue_ball_patch.left and
box_patch.right >= blue_ball_patch.right and
box_patch.upper <= blue_ball_patch.upper and
box_patch.lower >= blue_ball_patch.lower):
return "Yes"
return "No"
Query: "Is the table bigger than the chair?"
Incorrect Program:
def execute_command(image):
image_patch = ImagePatch(image)
table_patches = image_patch.find("table")
chair_patches = image_patch.find("chair")
if not table_patches or not chair_patches:
return "No"
if table_patch.area < chair_patch.area:
return "Yes"
return "No"
Error: name ’table_patch’ is not defined
Corrected Program:
def execute_command(image):
image_patch = ImagePatch(image)
table_patches = image_patch.find("table")
chair_patches = image_patch.find("chair")
if not table_patches or not chair_patches:
return "No"
table_patch = table_patches[0]
chair_patch = chair_patches[0]
if table_patch.area > chair_patch.area:
return "Yes"
return "No"
Query: "What is the color of the largest ball?"
Incorrect Program:
def execute_command(image):
image_patch = ImagePatch(image)
ball_patches = image_patch.find("ball")[0]
ball_patches.sort(key=lambda x: x.area)
largest_ball = ball_patches[-1] # Picks the smallest ball due to incorrect indexing
return largest_ball.simple_query("What is the color?")
Error: ’ImagePatch’ object has no attribute ’sort’
Corrected Program:
def execute_command(image):
image_patch = ImagePatch(image)
ball_patches = image_patch.find("ball")
ball_patches.sort(key=lambda x: x.area)
largest_ball = ball_patches[-1]
return largest_ball.simple_query("What is the color?")
Query: INSERT_QUERY_HERE
Incorrect Program:
INSERT_CODE_HERE
Error: INSERT_ERROR_HERE
Corrected Program:
Listing 11: Reprompting with Errors ITM
INSERT_IMAGE_PATCH_API
You are provided a Python program that answers a query about an image, with a set of tests with the corresponding outputs and exected responses.
Correct the Python program such that it passes the tests.
- Ensure the corrected program is different than the incorrect program provided.
Query: "Verify image matches text="An airplane is flying in the sky, and birds are flying below it.""
Incorrect Program:
def execute_command(image):
image_patch = ImagePatch(image)
airplane = image_patch.find("airplane")
birds = image_patch.find("birds")
if airplane[0].vertical_center > birds[0].vertical_center:
return "Yes"
return "No"
Error: IndexError: list index out of range
Corrected Program::
def execute_command(image):
image_patch = ImagePatch(image)
airplane_patches = image_patch.find("airplane")
bird_patches = image_patch.find("bird")
if not airplane_patches or not bird_patches:
return "No"
airplane = airplane_patches[0]
birds_below = all(bird.vertical_center > airplane.vertical_center for bird in bird_patches)
return "Yes" if birds_below else "No"
Query: "Verify image matches text="The bird is flying above the tree, and a cat is sitting under the tree.""
Incorrect Program:
def execute_command(image):
image_patch = ImagePatch(image)
tree = image_patch.find("tree")
bird = image_patch.find("bird")
cat = image_patch.find("cat")
if not tree or not bird or not cat:
return "No"
if bird.vertical_center < tree.vertical_center and cat.vertical_center > tree.vertical_center:
return "Yes"
return "No"
Error: list has no attribute vertical_center
Corrected Program:
def execute_command(image):
image_patch = ImagePatch(image)
tree_patches = image_patch.find("tree")
bird_patches = image_patch.find("bird")
cat_patches = image_patch.find("cat")
if not tree_patches or not bird_patches or not cat_patches:
return "No"
tree = tree_patches[0]
bird_above = all(bird.vertical_center < tree.vertical_center for bird in bird_patches)
cat_below = all(cat.vertical_center > tree.vertical_center for cat in cat_patches)
return "Yes" if bird_above and cat_below else "No"
Query: "Verify image matches text="A man is riding a bicycle, and a dog is running beside him.""
Incorrect Program:
def execute_command(image):
image_patch = ImagePatch(image)
man = image_patch.find("man")
bicycle = image_patch.find("bicycle")
dog = image_patch.find("dog")
if not man or not bicycle or not dog:
return "No"
if abs(man[0].center_x - dog[0].center_x) < 50:
return "Yes"
return "No"
Error: ImagePatch has no attribute center_x
Corrected Program:
def execute_command(image):
image_patch = ImagePatch(image)
man_patches = image_patch.find("man")
bicycle_patches = image_patch.find("bicycle")
dog_patches = image_patch.find("dog")
if not man_patches or not bicycle_patches or not dog_patches:
return "No"
man = man_patches[0]
bicycle = bicycle_patches[0]
dog_beside = any(abs(dog.horizontal_center - man.horizontal_center) < 100 for dog in dog_patches)
return "Yes" if dog_beside else "No"
Query: "Verify image matches text="A man is holding a red balloon, and a child is reaching up to grab it.""
Incorrect Program:
def execute_command(image):
image_patch = ImagePatch(image)
man = image_patch.find("man")
balloon = image_patch.find("balloon")
child = image_patch.find("child")
if not man or not balloon or not child:
return "No"
if balloon[0].verify_property("red") and child[0].vertical_center < balloon[0].vertical_center:
return "Yes"
return "No"
Error: verify_property() missing 1 required positional argument: ’visual_property’
Corrected Program:
def execute_command(image):
image_patch = ImagePatch(image)
man_patches = image_patch.find("man")
balloon_patches = image_patch.find("balloon")
child_patches = image_patch.find("child")
if not man_patches or not balloon_patches or not child_patches:
return "No"
balloon = balloon_patches[0]
is_red_balloon = balloon.verify_property("balloon", "red")
child_below_balloon = all(child.vertical_center < balloon.vertical_center for child in child_patches)
return "Yes" if is_red_balloon and child_below_balloon else "No"
Query: "Verify image matches text="A cat is sitting on the table, and a book is lying beside it.""
Incorrect Program:
def execute_command(image):
image_patch = ImagePatch(image)
cat_patches = image_patch.find("cat")
book_patches = image_patch.find("book")
if not cat_patches or not book_patches:
return "No"
if abs(cat.horizontal_center - book.horizontal_center) < 50:
return "Yes"
return "No"
Error: name ’cat’ is not defined
Corrected Program:
def execute_command(image):
image_patch = ImagePatch(image)
cat_patches = image_patch.find("cat")
book_patches = image_patch.find("book")
table_patches = image_patch.find("table")
if not cat_patches or not book_patches or not table_patches:
return "No"
cat = cat_patches[0]
book = book_patches[0]
table = table_patches[0]
is_cat_on_table = cat.vertical_center < table.vertical_center and abs(cat.horizontal_center - table.horizontal_center) < 50
is_book_beside_cat = abs(book.horizontal_center - cat.horizontal_center) < 50
return "Yes" if is_cat_on_table and is_book_beside_cat else "No"
Query: INSERT_QUERY_HERE
Incorrect Program:
INSERT_CODE_HERE
Error: INSERT_ERROR_HERE