This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Gazing at Rewards: Eye Movements as a Lens into
Human and AI Decision-Making in Hybrid Visual Foraging

Bo Wang1,2,3    Dingwei Tan1,2,4    Yen-Ling Kuo5    Zhaowei Sun3    Jeremy M. Wolfe6,7    Tat-Jen Cham1    Mengmi Zhang1,2 Corresponding author   
1College of Computing and Data Science, Nanyang Technological University, Singapore
2Deep NeuroCognition Lab, I2R and CFAR, Agency for Science, Technology and Research, Singapore
3Harbin Institute of Technology, Harbin, China
4Beijing Institute of Technology, Beijing, China
5University of Virginia, USA
6Brigham and Women’s Hospital, USA
7Harvard Medical School, USA
Address correspondence to [email protected]
Abstract

Imagine searching a collection of coins for quarters (0.250.25), dimes (0.100.10), nickels (0.050.05), and pennies (0.010.01)—a hybrid foraging task where observers look for multiple instances of multiple target types. In such tasks, how do target values and their prevalence influence foraging and eye movement behaviors (e.g., should you prioritize rare quarters or common nickels)? To explore this, we conducted human psychophysics experiments, revealing that humans are proficient reward foragers. Their eye fixations are drawn to regions with higher average rewards, fixation durations are longer on more valuable targets, and their cumulative rewards exceed chance, approaching the upper bound of optimal foragers. To probe these decision-making processes of humans, we developed a transformer-based Visual Forager (VF) model trained via reinforcement learning. Our VF model takes a series of targets, their corresponding values, and the search image as inputs, processes the images using foveated vision, and produces a sequence of eye movements along with decisions on whether to collect each fixated item. Our model outperforms all baselines, achieves cumulative rewards comparable to those of humans, and approximates human foraging behavior in eye movements and foraging biases within time-limited environments. Furthermore, stress tests on out-of-distribution tasks with novel targets, unseen values, and varying set sizes demonstrate the VF model’s effective generalization. Our work offers valuable insights into the relationship between eye movements and decision-making, with our model serving as a powerful tool for further exploration of this connection. All data, code, and models will be made publicly available.

1 Introduction

Refer to caption
Figure 1: Illustrative example of eye movements and decision-making in a hybrid visual foraging task. The image depicts a real-world scenario where the goal is to search piles of coins for multiple instances of target coins with varying monetary values in order to maximize the accumulative monetary reward, within a time-limited environment. Yellow dots and arrows represent the locations and order of eye movements during the search. Red bounding boxes show the target coins that are collected. Note that humans do not always collect every item they fixate on, highlighting the selective nature of the foraging process.

Hybrid visual foraging is a ubiquitous challenge in our daily life, such as grocery shopping for a list of items, simultaneously scanning for traffic lights, parking spaces, and restaurants while driving, or looking for a specific amount of change among piles of coins ( Fig. 1). These tasks involve searching for multiple instances of various target types stored in memory, where target values and prevalence can vary, and the exact number of target instances is often unknown. This raises a critical question about how to prioritize target selections during the search process. Understanding these dynamics is essential for optimizing search efficiency and decision-making in complex environments.

To tackle this question, eye movements could offer a unique window into the underlying perceptual, cognitive, and evaluative processes involved in decision-making, such as sensory evidence sampling and accumulation [137, 103, 84, 143, 75], decision timing and temporal expectation [11, 117, 107, 123, 6], response inhibitions [82, 46, 62, 85, 24], and decision certainty and confidence [60, 107, 23, 10, 93], offering high temporal and spatial resolution [116, 44, 72, 47, 100]. In hybrid visual foraging, while neuroscience and psychology works [134, 136, 135, 77, 130] have primarily examined the sequence of target selections within the same environment and the timing of search transitions across different environments especially when target values and prevalence vary, there is a notable lack of studies focusing on eye movements. Here, we design and conduct human psychophysics experiments to examine how foraging strategies and eye movements are influenced by the prevalence and value of targets.

Alongside studies in psychology and neuroscience, many AI models have been developed to predict eye movements during decision-making tasks, including visual search [57, 35, 2, 83, 142, 49, 125, 118], object recognition and detection [127, 8, 96, 88], and visual question answering [56, 18, 58]. Notably, existing visual search models integrate both bottom-up saliency [57, 35, 2] and top-down feature modulations [83, 142, 49]. However, these models assume idealized scenarios where either a single target type is present or multiple target types have equal values. As a result, they often overlook the need to prioritize target selections based on varying target prevalences and values during the search process. In this work, we introduce a computational model called Visual Forager (VF), a transformer-based architecture trained with reinforcement learning, designed to perform hybrid visual foraging efficiently across varying combinations of target prevalence and values. Unlike prior visual search models [26, 119, 19, 139], which often rely on human data for supervised training, our VF approximates human foraging behaviors and biases, despite zero training on human data. We highlight our key contributions:

1. Drawing from psychology, we introduce hybrid visual foraging tasks for AI models. The predicted eye movements offer a unique window into the decision-making process with high spatial and temporal resolution.

2. We propose an AI model, Visual Forager (VF), for hybrid visual foraging tasks. VF uses actor-critic networks with a vision transformer backbone and integrates feature-based and value-based modulations to guide decision-making processes, determining where to fixate next and whether to collect currently fixated items during foraging tasks.

3. To benchmark AI model performances, we design and conduct human eye-tracking experiments for hybrid visual foraging tasks. Despite no training on human data, our VF achieves cumulative rewards comparable to human participants and approximates their foraging behaviors, including eye movements and decision biases toward highly valued and prevalent targets.

4. Humans can flexibly adapt their foraging strategies to maximize total rewards under varying target values and prevalence. Remarkably, our VF also performs efficient foraging, under out-of-distribution conditions it was never trained on. This capability is attributed to our newly introduced data augmentations applied to target values.

2 Related Work

Eye movements as a window into decision making. Many psychology and neuroscience works have been using eye movements as a unique and non-invasive means [76, 114, 92, 59, 66, 113, 39, 37, 72, 73, 40, 98, 102, 101] to reflect how efficiently we interpret decision instructions [68, 137], the timing and duration of decision formation [11, 117, 107, 123, 6], expected rewards [128, 140, 41, 109, 122], decision accuracy [107], and our confidence in the outcome [60, 23, 10, 93]. Although multiple studies analyze the decision making in hybrid visual foraging tasks [97, 17, 15, 141, 34], there is a lack of research specifically examining eye movements and their relationships to decision making. To close this gap, we design and conduct eye-tracking experiments to investigate eye movements and their interactions with foraging behaviors.

Computational models for goal-directed visual search. Eye movement in visual search is typically guided by five primary factors [133]: bottom-up saliency [63, 121, 132], top-down feature guidance [33, 81, 29], scene properties [12, 54, 120], prior history [129, 31, 67], and item values [4, 79, 3]. A range of computational models have been proposed to model human eye movements in each of these aspects during visual search [42, 125, 57, 74, 126, 21, 22, 13, 71, 124, 115, 16, 91, 16, 65, 64, 45, 2, 83, 142, 49]. Recent deep-learning models, such as [90, 138, 48, 9, 142, 49, 143, 30], have demonstrated success in searching for a single target in complex, naturalistic scenes. However, these models fail to look for multiple instances of multiple target types based on varying values and prevalence. To close this gap, we introduce Visual Forager (VF), capable of searching for multiple instances across different target types. Moreover, while models like GazeFormer and IRL rely on human data for training and struggle to generalize to unseen targets or unseen combinations of values and prevalence in out-of-distribution scenarios, VF can flexibly adapt its foraging strategy under these conditions.

Deep reinforcement learning (RL) for value-guided decision-making. Deep RL models [89, 87, 50, 106, 1, 36, 99] have been applied to a wide range of decision-making tasks, including video game playing [86, 110, 78], robotic control [5, 104, 53], and resource management [80, 20, 144]. These models learn to make sequential decisions by interacting with environments and receiving feedback through immediate or long-term reward [25, 7, 69, 28, 38]. Despite advancements in these decision-making models, they have not been specifically designed to model eye movements in hybrid foraging tasks. Here, we introduce a transformer-based network to learn optimal foraging and eye movement policies. This is in contrast to [52, 96, 108], where a transformer-based architecture is only used to extract visual features prior to decision-making.

Refer to caption
Figure 2: Schematic of the hybrid visual foraging experiment. Each foraging trial starts with a 2-second center fixation (omitted here for simplicity), followed by the presentation of target images and their associated values (e.g., a plant valued at 4). To ensure human participants memorize the targets and their values, they must pass a recognition test by selecting all targets among distractors and correctly matching their values. If they make errors, they repeat the target and value presentation phases. After another 2-second center fixation presentation, an object array is displayed. Both human and AI agents are tasked with collecting as many targets as possible through mouse clicks to maximize their total rewards, where rewards correspond to the values of the target objects, and a penalty of -1 is incurred for clicking on distractors. The trial ends either after 30 seconds or when 20 clicks are made.

3 Hybrid visual foraging

3.1 Human psychophysics experiments

We conducted an in-lab psychophysics experiment for hybrid visual foraging, schematically illustrated in Fig. 2. The items were randomly arranged on a 16×16 grid, containing either 90, 105, or 120 items, with 20% to 30% of the squares filled with target instances. In each foraging trial, subjects searched for N{1,2,4}N\in\{1,2,4\} target objects, each with a varying number of target instances. Targets and distractors in the hybrid foraging search arrays were randomly selected from a pool of 2,400 unique items in [14]. A total of 15 subjects were recruited, yielding 750 trials, containing 50514 eye fixations and 12851 mouse clicks. All the experiments are conducted with the subjects’ informed consent and according to the protocols approved by the Institutional Review Board of our institution. See Appendix LABEL:{sec:humanpsyExp} for more details.

3.2 Foraging environments for AI models

Humans have accumulated years of visual experiences, making them efficient zero-shot visual searchers without requiring prior training for hybrid visual foraging [135]. In contrast, AI models require task-specific training. Hence, we introduced procedural generation of diverse foraging environments, subsequently used to train AI models. As demonstrated in the human psychophysics experiments described above, the procedural generation of a foraging trial depends on several key experimental parameters: the total number of items on the object arrays within a 16×16-sized grid, the number of target objects on the search arrays, the prevalence of target instances for each target object, the values assigned to the target objects, and the selection of target and distractor items from a pool of 2,400 unique items. It is evident that the decision-making processes of both humans and AI models may be influenced by any of these experimental parameters. We sub-sample various combinations of these experimental parameters for the procedural generation of foraging environments: the total number of items on the search array is fixed at 105. A fixed set of 4 items is randomly selected as targets with their values set at 2, 4, 8, and 16 and their prevalence randomly determined. See Appendix Sec. S1.2 for details.

To benchmark AI model performance in hybrid foraging tasks, we introduce two in-domain hybrid foraging conditions that align with the distribution of the training environments that the AI models were optimized to solve. (1) In-domain Uneven Value, Equal Prevalence (UnValEqPre): the prevalence of all 4 targets is equal but their values vary. (2) In-domain Uneven Value, Unequal Prevalence (UnValUnPre): both target prevalence and values vary, with low-value targets being more common and high-value targets scarcer. To assess whether the AI models can generalize to out-of-distribution (OOD) hybrid visual search tasks, where experimental parameters differ from those encountered during training, we introduced five OOD conditions. (1) OOD - Even Value, UnEqual Prevalence (EqValUnPre): the 4 target types have the equal values with varying prevalence. (2) OOD - Unseen target objects (UTargets) The target images are unseen during testing. (3) OOD - Unseen value combinations (UValues) The target values exceeded the range used for training, with their values changing in either arithmetic or geometric series. (4) OOD - Unseen total item numbers (UItemNum): The total number of items on the search grid differs from the one used for training. (5) OOD - Unseen target object sizes (USetSize) The set size of target objects is manipulated. See Appendix Sec. S1.3 for details of each condition.

Comparisons of foraging environments between humans and AIs. For a fair comparison between human and AI performance, we categorized the foraging trials from human psychophysics experiments into the same seven conditions above used for testing the AI models. We apply identical rewards and penalties to humans and all AI models for training and evaluation. Moreover, to interact with the foraging environments, both humans and AI models must decide where to move their eye fixations next and whether to click on the current item. In human psychophysics experiments, each trial ends either after 30 seconds or once 20 clicks are made. Unlike humans, AI models do not experience time delays in eye movements or motor responses for mouse clicks. To simulate these time constraints and allow for a fair comparison between the fixation and click sequences of AI models and humans, we impose constant time costs on the AI models: 776 milliseconds for one click and 336 milliseconds for one fixation. These time costs are calculated through linear regression based on human response data (Appendix Fig. S2 and Fig. S2).

4 Our proposed Visual Forager (VF)

Refer to caption
Figure 3: Architecture overview of our Visual Forager. VF consists of three modules elaborated in LABEL:{sec:modelVF}: visual feature modulation from target images with foveated vision mimicking eccentricity-dependent sampling in human vision (Sec. 4.1), modulation from various values of different targets (Sec. 4.2), and decision-making process with an actor-critic transformer architecture, outputting next fixation locations from predicted attention maps and the probability of clicking the currently fixated item (Sec. 4.3).

We propose the Visual Forager (VF), a computational model for hybrid visual foraging (Fig. 3). At the current ttth fixation location FtF_{t}, the model takes as input the search image ISI_{S} and NN target images IT1:NI_{T}^{1:N} with their corresponding target values V1:NV^{1:N}, processes them with foveated vision, and modulates the decision making with visual features of target images and their corresponding values. Our VF outputs the predicted t+1t+1th fixation location Ft+1F_{t+1} and the mouse-click policies for the currently fixated item. VF comprises three modules: target feature modulation, target value modulation, and decision-making.

4.1 Target feature modulation

Our VF processes the NN input target images IT1:N{I_{T}^{1:N}} with a feed-forward convolution neural network (2D-CNN) to extract their feature maps ϕT1:N\phi_{T}^{1:N} at the last convolution layer. We used the VGG16 network [111], pre-trained on ImageNet [27] as the backbone of the feature extractor and froze its weights during training of hybrid foraging tasks.

Receptive field sizes in the visual cortex increase progressively across brain areas [43]. This scaling is modeled in current visual recognition systems through eccentricity-dependent pooling operations [49]. Unlike uniform sampling in standard max pooling, receptive field size in eccentricity-dependent pooling layers grows with increasing distance from the current fixation FtF_{t}. Following [49], we replace VGG16’s standard max pooling layers l10l_{10}, l14l_{14}, and l18l_{18} with eccentricity-dependent pooling [49]. This modified feature extractor produces feature maps ϕS,t\phi_{S,t} for the search image S\mathcal{I}_{S} from the last convolutional layer, given the fixation location FtF_{t}. For simplification of mathematical notations, we omit the subscript tt in ϕS,t\phi_{S,t} for the rest of the text, unless specified. We use pre-trained ImageNet weights for VGG16 and freeze them for semantic feature extraction. Since pooling layers lack learnable weights, freezing the pre-trained weights of eccentricity-dependent VGG16 does not affect feature extraction quality. Note that we do not use eccentricity-dependent VGG16 to process feature maps of input target images T1:N\mathcal{I}_{T}^{1:N} as these target images are often small and they can be processed in high resolution within the foveated region.

Visual search or feature-based attention is often modulated by the visual features of the targets [142, 49, 143]. Similar to [142], we implement the attentional modulation by computing the similarity between the features ϕTi\phi_{T}^{i} of the iith target image ITiI_{T}^{i} and the features ϕS\phi_{S} of the search image ISI_{S}: MFi=(ϕTi,ϕS)M_{F}^{i}=\mathcal{M}(\phi_{T}^{i},\phi_{S}), where ()\mathcal{M}(\cdot) is the 2D convolution of stride 2 with ϕTi\phi_{T}^{i} serving as a convolution kernel applied to ϕS\phi_{S}. We repeat the same attentional modulation for each target image, resulting in NN similarity maps MFH×W×NM_{F}\in\mathbb{R}^{H\times W\times N}, where H×W=16×16H\times W=16\times 16 denotes the grid size of the object arrays on ISI_{S}.

4.2 Target value modulation

Values have proven to be a strong modulator of guidance in visual search [3]. To incorporate target values in our VF, we introduce a value encoder V()\mathcal{E}_{V}(\cdot) consisting of two fully connected layers and a ReLU activation layer in between. V\mathcal{E}_{V} takes NN target values V1:NNV^{1:N}\in\mathbb{R}^{N} as input and produces a DD-dimensional value embedding V(V1:N)\mathcal{E}_{V}(V^{1:N}) that encodes the target values. To embed EVE_{V} into MFM_{F}, we also introduce a 2D-CNN encoder F()\mathcal{E}_{F}(\cdot) to extract feature maps F(MF)\mathcal{E}_{F}(M_{F}) of size H×W×DH\times W\times D from the similarity maps MFM_{F}. The encoder comprises convolution blocks with ReLU activations, using 1×11\times 1 convolution kernels with stride 1 to maintain the spatial resolution of MFM_{F}. No pooling layers are used. These convolution blocks facilitate feature fusion across all NN similarity maps within each location but not across H×WH\times W locations.

Similar to the way that positional embeddings are added to each feature vector of the image patches in the vision transformer (ViT) [32], we duplicate the value embedding over all H×WH\times W locations and perform element-wise summation \oplus on F(MF)\mathcal{E}_{F}(M_{F}):

MV=F(MF)duplicate(V(V1:N))M_{V}=\mathcal{E}_{F}(M_{F})\oplus\text{duplicate}(\mathcal{E}_{V}(V^{1:N})) (1)

This results in the value-modulated feature maps MVH×W×DM_{V}\in\mathbb{R}^{H\times W\times D}, allowing the model to adapt the target similarity features based on target values. Note that the value-modulated feature maps MVM_{V} is fixation-dependent, since ϕS\phi_{S} changes based on the current fixation FtF_{t}. Due to math simplifications, we omit the subscript tt for MVM_{V}.

4.3 Policy network for decision making

Given MVM_{V} at the current fixation FtF_{t} as the state ss of the environment, VF must decide when to fixate next and whether to collect the currently fixated item. We model this as a Markov Decision Process with finite states and action spaces. Fixation locations are discretized to the H×WH\times W grid, corresponding to the search array size on ISI_{S}. The mouse click is a binary decision of whether to collect the fixated item. Here, we introduce the architecture of the decision-making module, also known as the policy network.

We employ the vision transformer (ViT) [32]) as the backbone for our decision-making network with its weights randomly initialized. We reshape the value-modulated similarity feature maps MVM_{V} into a sequence of patches xpP×D,whereP=H×Wx_{p}\in\mathbb{R}^{P\times D},\text{where}\ P=H\times W. To retain positional information for all the patches, we add the standard learnable 1D positional embeddings to all the patches.

Similar to the role of the classification token in the original ViT, we introduce the extra click token to the patch sequence. A click action head is attached to the click token at the last layer of the transformer. The click action head is implemented with an average pooling layer over DD dimensions of the click token embedding followed by a sigmoid function, outputting the probability PCP_{C} of the click. A click action aca_{c} is sampled based on PCP_{C}.

In parallel, we take the embeddings of all the patches from xpx_{p} in the last layer of the ViT, apply average pooling over DD dimensions for every patch, and obtain the logit of fixation probability over total PP locations. Next, we use the softmax operation to normalize the logit map and reshape it to a 2D probabilistic attention map PFH×WP_{F}\in\mathbb{R}^{H\times W} indicating the most probable fixation location at t+1t+1. The next fixation action af=Ft+1a_{f}=F_{t+1} is sampled based on PFP_{F}.

4.4 Training Details

To reduce overfitting and enhance the generalization of our VF, during training, we implement a channel augmentation technique that shuffles the input order of target images and their corresponding value pairs. We train VF with reinforcement learning (RL). The goal of RL is to jointly learn the fixation policy πf(|s)\pi_{f}(\cdot|s) and the click policy πc(|s)\pi_{c}(\cdot|s), aiming to maximize the expected cumulative reward. Here, our VF uses the stochastic policy, where πf(af|s)=PF\pi_{f}(a_{f}|s)=P_{F} and πc(ac|s)=PC\pi_{c}(a_{c}|s)=P_{C}. With these two separate policies, we can combine them into a joint probability over a multi-discrete action space PA=PCPF=π(ac,af|s)P_{A}=P_{C}P_{F}=\pi(a_{c},a_{f}|s).

We specifically used the Proximal Policy Optimization (PPO) algorithm [106] with the Adam optimizer [61] for training PAP_{A}. PPO is a policy gradient method designed to enhance the stability and efficiency of policy updates. Preliminaries on RL and PPO, as well as PPO hyperparameters, can be found in Appendix Sec. S2.1.

In practice, PPO often employs an actor-critic network [106]. The actor network selects actions based on the current policy, while the critic network evaluates these actions by learning the value function, assessing the quality of the chosen actions. We use the policy network outlined in Sec. 4.3 as our actor network. To estimate the current state’s value (or state-value in RL), we add a special state-value token to the token sequence in the policy network. A critic head, implemented as a single fully connected layer, is attached to this token embedding at the final transformer layer to output the state-value VsV_{s}.

Two-stage training with curricula. Hybrid visual foraging remains one of the challenges for decision-making due to the diversity of conditions in search scenes. Training VF for this task requires a large amount of interactions with the environments. To address this problem, similar to other curriculum learning works in RL [95, 94, 70], we introduce two-stage training with curricula where the transfer learning is applied to VF such that experience gained in an easy environment can be leveraged when starting to learn the next harder task.

In the first training stage, we focus solely on training VF’s fixation policy πf(|s)\pi_{f}(\cdot|s), omitting the click policy πc(|s)\pi_{c}(\cdot|s). To encourage VF to fixate on high-value targets, we introduce intrinsic rewards aligned with the rewards for correct clicks: VF receives a reward equal to the target’s value for accurate fixations and a penalty of -1 for fixating on distractors. Additionally, a small penalty of -0.01 discourages fixation on blank areas. To further ease the task, we temporarily disable VF’s eccentricity-dependent vision, allowing it full-resolution access to the search image.

Without eccentricity-dependent vision, MVM_{V} becomes independent of the fixation FtF_{t} and remains static across all fixations, resulting in a fixed policy for a given state. Thus, MVM_{V} must be modified to enable exploration and prevent VF from repeatedly choosing the same location. Like many other visual search models [55, 112, 142], we introduce an infinite inhibition-of-return (IOR) mechanism, which zeros out all previously visited cells of MVM_{V} on the 16×1616\times 16 grid.

We proceed with the original hybrid foraging setup in the second training stage. We transfer all the weights pre-trained in stage one and freeze them. Only the positional embeddings and the weights responsible for the click policy πc(|s)\pi_{c}(\cdot|s) are fine-tunable. The same scoring systems for humans are applied to train our VF. Our VF processes ISI_{S} with eccentricity-dependent vision.

In contrast to the infinite IOR in the first stage, we introduce a memory decay mechanism that approximates finite IOR, balancing the exploration of new locations and revisits of known ones. Although finite IOR reflects limitations in memory capacity, it also enhances vision by enabling strategic revisits in complex environments. For example, after exhausting high-value targets, returning to prior locations may help identify the next best options, thereby improving decision-making outcomes. Our VF employs a fixed finite IOR, independent of specific experimental conditions. The suppression of attention values on previously visited locations in MFM_{F} decays over time. At the current FtF_{t}, the suppression on past locations t~{1,2,,t}\tilde{t}\in\{1,2,\dots,t\} is given by: mt~,t=ηtt~m_{\tilde{t},t}=\eta^{t-\tilde{t}}, where η=0.8\eta=0.8. This allows the inhibition of previously visited locations to gradually weaken, enabling VF to revisit those locations.

4.5 Baselines and Evaluation Metrics

We introduced six baseline models. (1) Chance: A sequence of eye fixations is generated through uniform random sampling on a 16x16 grid, with a mouse click occurring at each fixated item with a 50% probability. (2) Visual Feature Modulation Only (FeatOnly): The method discards value modulations and predicts a sequence of eye fixations based solely on the items with the highest activation on the eccentricity-dependent similarity maps MFM_{F} at FtF_{t}. The sequence of clicks directly corresponds to the sequence of eye fixations. (3) Max Value First (MaxVal) Eccentricity-dependent feature similarity maps MFM_{F} are modulated by multiplying each of its maps with the value of the corresponding target object in V1:NV^{1:N}. A sequence of eye fixations, equivalent to mouse clicks, is predicted by selecting the items with the highest values from value-modulated MVM_{V}. (4) Average Value First (AvgVal) We compute the average of value-modulated MVM_{V} in MaxVal over all target objects. The items with the highest averaged values get fixated and collected by the model. (5) Deep-Q learning (DQN) Instead of modular designs in our VF with the transformer architecture, we use a classical 2D-CNN-based deep-Q network [89] and train it end-to-end. See Appendix Sec. S2.3 for implementation details. (6) Upper bound (UpperBound) It is an oracle model with perfect target localization and recognition abilities, capable of making globally optimal decisions by always selecting the target instances with the highest values in the search arrays and never clicking on a distractor.

All the baseline models use infinite inhibition of return to track previous fixation locations and prevent revisiting them. This is accomplished by either masking the visited item locations in the action probability distribution for DQN or by removing the clicked or fixated items from the search arrays for the other baselines introduced above.

We propose two evaluation metrics. First, Normalized Score (Norm.Score) refers to the cumulative rewards as a function of the number of clicks within a foraging trial, normalized by the maximum score of UpperBound for that trial. The normalized score has an upper limit of 1, and a higher normalized score indicates that AI models or humans are more effective at making decisions to optimize cumulative rewards. Second, Click Bias Ratio (CBR) reflects the clicking biases of humans and AI models during foraging. We define the proportion of selections of target objects as Proportion Picked (PP) and the proportion of target objects left on the array as Proportion On-Screen (POS). CBR is then calculated as the normalized relative difference between PP and POS: CBR=PPPOSPP+POSCBR=\frac{PP-POS}{PP+POS}. CBR can change over the course of clicks within a trial. If neither humans nor AI models exhibit clicking preferences, PP will equal POS, resulting in a CBR of 0. If either agent prefers one target object over others, PP will be greater than POS, yielding a positive CBR. If an AI model aligns with human clicking bias patterns in a given experiment, their CBR values will share the same signs.

5 Results

Refer to caption
Figure 4: Humans and AI models are reward-seeking agents. We report the normalized scores (Norm. Score) as a function of click numbers for humans (red), our VF model (blue), and other baseline models (varying gray). Chance is in black. Three experimental conditions of foraging trials are included with varying prevalence and values of target objects. See Sec. 4.5 for evaluation metrics and baselines, and Sec. 3.2 for experimental conditions.

5.1 Humans and AI models are reward-seeking

Humans and AI models are proficient foragers. We present the Norm.Score of humans and AI models under three conditions where target prevalence, value, or both vary, as shown in Fig. 4. Human subjects achieved Norm. Scores of 87.4%, 84.1%, and 93.1% across these conditions, significantly exceeding chance levels. Notably, even at the first mouse click, the Norm. Score was already well above chance. Similarly, all competitive baselines and our VF model consistently outperformed chance across all click numbers. These results suggest that both human subjects and AI models are effective at hybrid visual foraging, with their mouse click decisions strongly influenced by the values of target objects. However, humans never reached the upper bound of performance, indicating they are not perfect global optimizers. This may be due to imperfect object recognition [83], limited memory capacity [131], and foveated vision with restricted receptive fields [43].

Our VF model outperform all the baseline models. The Norm. Scores of our VF model are 72.6%, 67.1%, and 81.6% across the three conditions, surpassing all baseline models. This indicates that our VF model effectively learns to optimize decision-making in foraging tasks and adapts well to variations in value and prevalence. Among the baseline models, FeatOnly achieved the highest Norm. Score, although it still performed worse than VF. This highlights the importance of visual feature modulation in foraging tasks, but also reveals that feature modulation alone is insufficient for optimal decision-making. While both AvgVal and MaxVal models incorporate value-based guidance into their decision strategies, their lower performance compared to FeatOnly suggests that simply relying on explicit values is ineffective. Rather, value and feature modulations must interact in a more sophisticated manner to achieve optimal performance.

5.2 Eye movements are effected by target values

Eye fixations tend to be drawn to regions associated with higher rewards. Eye movements can serve as a valuable lens for examining the decision-making process [51]. Here, we analyzed eye fixation locations and their correlation with rewards in the corresponding areas. In Fig. S4, we present the average rewards of all target objects within the fixated areas, defined as those falling within a radius of 1.5 degrees of visual angle around each fixation. Surprisingly, we found that both humans and VF tend to fixate on regions associated with average rewards of 3.31 and 8.08, significantly higher than the averages derived from random fixations. This indicates that target values guide fixations during decision-making for both humans and VF. Despite limited visual coverage due to foveation, both agents can effectively explore more rewarding areas. We also conducted the human fixation duration analysis. Remarkably, we found that humans tend to spend more time fixating on higher-value targets compared to those with lower values (Appendix Sec. S3.2).

5.3 Behavioral alignment between humans and VF

Refer to caption
Figure 5: (A) Our VF model has consistent clicking biases with humans. Humans (red) and our VF models (blue) share the same signs of CBR for most targets under UnValEqPre (a) and UnValUnPre (b). Chance (gray) has no preferences over target objects; hence, a CBR of 0. (B) Our VF model approximates humans in saccade size distributions. Saccade size distributions for humans (red), our VF model (blue), and our VF model with eccentricity removed (light blue) are presented. Vertical dash lines in colors indicate their mean saccade sizes in visual angle degrees. (C) Our VF model can generalize to out-of-distribution hybrid foraging tasks. Spider plot shows Norm.Score for humans (red), our VF (blue) and FeatOnly baseline (black dotted) under 7 experimental conditions.

Both humans and AI models tend to overpick high-valued targets and underpick low-valued targets. In this analysis, we examine the items selected by humans and AI models from the search arrays and report their click bias ratios (CBR) averaged over all click numbers under the UnValEqPre and UnValUnPre conditions, as shown in Fig. 5A. Consistent with [135], we find that humans tend to overpick the highest-valued targets, as indicated by a positive CBR, while they underpick the lowest-valued targets, reflected by a negative CBR. Interestingly, our VF exhibits similar signs and magnitudes of CBR on these targets, suggesting that it displays analogous click biases based on target values as humans. However, both humans and VF do not exclusively select the highest-valued targets, as their Proportion Picked (PP) for less-valued targets is not exactly zero. This suggests that both humans and VF occasionally prioritize lower-valued targets during foraging (see Fig. S5 in Appendix). Moreover, we also noted a slight discrepancy in the signs of CBR when humans and VF select the second-highest-valued items. This indicates that VF is less sensitive to medium rewards compared to humans (also see Appendix Fig. S5).

VF approximates the saccade size distributions of humans, without training on human eye movements. Human saccade sizes are restricted by physiological limitations of the eye muscles and the necessity for accurate visual processing in the foveal region [43]. We aggregate all saccade sizes from humans and our VF model under 3 conditions and plot their distributions in Fig. 5B. Despite lacking prior training on human eye movements, VF yields a mean saccade size of 4.06 degree , closely approximating the mean saccade size of 4.05 degree for humans. Additionally, to investigate the mechanisms in VF that constrain saccade sizes, we present the distribution of saccade sizes with eccentricity-dependent layers replaced with standard max-pooling layers in the visual feature extractor. As expected, and in line with [49], the saccade sizes increase, suggesting that smaller saccade sizes are partially due to foveal processing.

5.4 VF generalizes to OOD conditions

To assess the generalization performance of our VF model in out-of-distribution (OOD) hybrid foraging tasks, we benchmark humans, VF, and the best baseline model, FeatOnly, across five OOD conditions (Sec. 3.2). From Fig. 5C, VF outperforms FeatOnly in all experimental conditions, indicating that it learns generic decision-making strategies and adapts well to unseen foraging scenarios. However, VF still lags behind humans, suggesting that humans can more flexibly adjust their foraging strategies to maximize rewards based on the environments.

5.5 Ablations reveal critical component designs

Ablations UnVal UnVal EqVal
EqPre UnPre UnPre
Behavior Clone 61.7 48.5 60.1
VF (2D-CNN) 75.3 63.7 70.0
Explicit Val. Emb. 69.2 56.7 61.8
W/o Augmentation 51.3 52.0 52.2
Full VF (ours) 72.6 67.1 81.6
Table 1: Ablation studies reveal critical design choices of our VF model. Norm.Score for various ablated models are reported over UnValEqPre, UnValUnPre, and EqValUnPre conditions. See Sec. 5.5 for ablated models. Best is in bold.

We systematically ablated several essential components in our VF model and reported their results in Tab. 1. (1) Rather than using reinforcement learning, we train VF on human eye movements and mouse clicks through supervised learning (Behavior Cloning). The lower Norm. Score of Behavior Cloning indicates that human eye movement data is limited, leading to model overfitting. Hence, the model struggles to generalize to unseen target value and prevalence combinations. (2) We replace the transformer-based decision-making module with a 2D-CNN, referred to as VF(2D-CNN). The lower Norm. Score of this ablated model indicates that the transformer architecture, with its ability to capture long-range dependencies and global context through self-attention, leads to better decision-making. (3) We ablate VF by replacing the learnable value encoder with explicit value embeddings and directly feeding them into the transformer (Explicit Val. Emb.). The small drop in Norm. Score suggests that a learnable value embedding is more effective for making better decisions. (4) We remove the permutations of target and value pairs (W/o Augmentation), resulting in a significant drop in Norm. Score, especially under the EqValUnPre condition. This indicates that the data augmentation in VF is crucial for enhancing generalization to OOD hybrid foraging tasks.

6 Discussion

In hybrid visual foraging, humans are proficient foragers, directing attention to regions with significantly higher rewards than chance. When there is an imbalance in target prevalence or values, humans tend to over-exploit the most prevalent or high-valued target types. To explain the characteristics of human behavior in hybrid visual foraging, we propose a transformer-based Visual Forager (VF) trained in reinforcement learning. Its cumulative rewards match human performance and surpass other baseline models. Despite zero training on any human data, VF closely approximates human foraging behaviors, including foraging biases and eye movements. Remarkably, VF demonstrates exceptional generalization abilities, flexibly adjusting its foraging strategies to experimental conditions it has never encountered. Our work paves the way for several new research directions in psychology, neuroscience, and AI. We discuss these future works in Appendix Sec. S4.

References

  • Abdolmaleki et al. [2018] Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Riedmiller. Maximum a posteriori policy optimisation. arXiv preprint arXiv:1806.06920, 2018.
  • Adeli and Zelinsky [2018] Hossein Adeli and Gregory Zelinsky. Deep-bcn: Deep networks meet biased competition to create a brain-inspired model of attention control. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 1932–1942, 2018.
  • Anderson and Yantis [2013] Brian A Anderson and Steven Yantis. Persistence of value-driven attentional capture. Journal of Experimental Psychology: Human Perception and Performance, 39(1):6, 2013.
  • Anderson et al. [2011] Brian A Anderson, Patryk A Laurent, and Steven Yantis. Value-driven attentional capture. Proceedings of the National Academy of Sciences, 108(25):10367–10371, 2011.
  • Andrychowicz et al. [2020] OpenAI: Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipulation. The International Journal of Robotics Research, 39(1):3–20, 2020.
  • Ang and Maus [2020] Jit Wei A Ang and Gerrit W Maus. Boosted visual performance after eye blinks. Journal of Vision, 20(10):2–2, 2020.
  • Arjona-Medina et al. [2019] Jose A Arjona-Medina, Michael Gillhofer, Michael Widrich, Thomas Unterthiner, Johannes Brandstetter, and Sepp Hochreiter. Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems, 32, 2019.
  • Ba et al. [2014] Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755, 2014.
  • Baee et al. [2021] Sonia Baee, Erfan Pakdamanian, Inki Kim, Lu Feng, Vicente Ordonez, and Laura Barnes. Medirl: Predicting the visual attention of drivers via maximum entropy deep inverse reinforcement learning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 13178–13188, 2021.
  • Balsdon et al. [2020] Tarryn Balsdon, Valentin Wyart, and Pascal Mamassian. Confidence controls perceptual evidence accumulation. Nature communications, 11(1):1753, 2020.
  • Bekkering et al. [1994] Harold Bekkering, Jos J Adam, Herman Kingma, A Huson, and HTA Whiting. Reaction time latencies of eye and hand movements in single-and dual-task conditions. Experimental brain research, 97:471–476, 1994.
  • Biederman et al. [1982] Irving Biederman, Robert J Mezzanotte, and Jan C Rabinowitz. Scene perception: Detecting and judging objects undergoing relational violations. Cognitive psychology, 14(2):143–177, 1982.
  • Borji et al. [2013] Ali Borji, Dicky N Sihite, and Laurent Itti. What/where to look next? modeling top-down visual attention in complex interactive environments. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 44(5):523–538, 2013.
  • Brady et al. [2008] Timothy F Brady, Talia Konkle, George A Alvarez, and Aude Oliva. Visual long-term memory has a massive storage capacity for object details. Proceedings of the National Academy of Sciences, 105(38):14325–14329, 2008.
  • Cain et al. [2012] Matthew S Cain, Edward Vul, Kait Clark, and Stephen R Mitroff. A bayesian optimal foraging model of human visual search. Psychological science, 23(9):1047–1054, 2012.
  • Callaway et al. [2021] Frederick Callaway, Antonio Rangel, and Thomas L Griffiths. Fixation patterns in simple choice reflect optimal information sampling. PLoS computational biology, 17(3):e1008863, 2021.
  • Charnov [1976] Eric L Charnov. Optimal foraging, the marginal value theorem. Theoretical population biology, 9(2):129–136, 1976.
  • Chen et al. [2021] Xianyu Chen, Ming Jiang, and Qi Zhao. Predicting human scanpaths in visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10876–10885, 2021.
  • Chen et al. [2024] Xianyu Chen, Ming Jiang, and Qi Zhao. Beyond average: Individualized visual scanpath prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 25420–25431, 2024.
  • Chen et al. [2020] Ying Chen, Zhiyong Liu, Yongchao Zhang, Yuan Wu, Xin Chen, and Lian Zhao. Deep reinforcement learning-based dynamic resource management for mobile edge computing in industrial internet of things. IEEE Transactions on Industrial Informatics, 17(7):4925–4934, 2020.
  • Cho et al. [2018] Sun-Joo Cho, Sarah Brown-Schmidt, and Woo-yeol Lee. Autoregressive generalized linear mixed effect models with crossed random effects: An application to intensive binary time series eye-tracking data. Psychometrika, 83:751–771, 2018.
  • Chuk et al. [2020] Tim Chuk, Antoni B Chan, Shinsuke Shimojo, and Janet H Hsiao. Eye movement analysis with switching hidden markov models. Behavior research methods, 52:1026–1043, 2020.
  • Colizoli et al. [2018] Olympia Colizoli, Jan Willem de Gee, Anne E Urai, and Tobias H Donner. Task-evoked pupil responses reflect internal belief states. Scientific reports, 8(1):13702, 2018.
  • Colzato et al. [2009] Lorenza Serena Colzato, Wery PM Van Den Wildenberg, Nelleke C van Wouwe, Merel M Pannebakker, and Bernhard Hommel. Dopamine and inhibitory action control: evidence from spontaneous eye blink rates. Experimental brain research, 196:467–474, 2009.
  • Dai and Walter [2019] Falcon Dai and Matthew Walter. Maximum expected hitting cost of a markov decision process and informativeness of rewards. Advances in Neural Information Processing Systems, 32, 2019.
  • de Belen et al. [2022] Ryan Anthony Jalova de Belen, Tomasz Bednarz, and Arcot Sowmya. Scanpathnet: A recurrent mixture density network for scanpath prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5010–5020, 2022.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  • Devidze et al. [2022] Rati Devidze, Parameswaran Kamalaruban, and Adish Singla. Exploration-guided reward shaping for reinforcement learning under sparse rewards. Advances in Neural Information Processing Systems, 35:5829–5842, 2022.
  • DiCarlo et al. [2012] James J DiCarlo, Davide Zoccolan, and Nicole C Rust. How does the brain solve visual object recognition? Neuron, 73(3):415–434, 2012.
  • Ding et al. [2022] Zhiwei Ding, Xuezhe Ren, Erwan David, Melissa Vo, Gabriel Kreiman, and Mengmi Zhang. Efficient zero-shot visual search via target and context-aware transformer. arXiv preprint arXiv:2211.13470, 2022.
  • Donk and Theeuwes [2003] Mieke Donk and Jan Theeuwes. Prioritizing selection of new elements: Bottom-up versus top-down control. Perception & psychophysics, 65:1231–1242, 2003.
  • Dosovitskiy [2020] Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  • Egeth et al. [1984] Howard E Egeth, Robert A Virzi, and Hadley Garbart. Searching for conjunctively defined targets. Journal of Experimental Psychology: Human Perception and Performance, 10(1):32, 1984.
  • Ehinger and Wolfe [2016] Krista A Ehinger and Jeremy M Wolfe. When is it time to move to the next map? optimal foraging in guided visual search. Attention, Perception, & Psychophysics, 78:2135–2151, 2016.
  • Engbert et al. [2015] Ralf Engbert, Hans A Trukenbrod, Simon Barthelmé, and Felix A Wichmann. Spatial statistics and attentional dynamics in scene viewing. Journal of vision, 15(1):14–14, 2015.
  • Espeholt et al. [2018] Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In International conference on machine learning, pages 1407–1416. PMLR, 2018.
  • Evdokimidis et al. [2002] I Evdokimidis, N Smyrnis, Theodoros Constantinidis, N Stefanis, D Avramopoulos, C Paximadis, C Theleritis, C Efstratiadis, G Kastrinakis, and C Stefanis. The antisaccade task in a sample of 2,006 young men: I. normal population characteristics. Experimental Brain Research, 147:45–52, 2002.
  • Eysenbach et al. [2022] Benjamin Eysenbach, Tianjun Zhang, Sergey Levine, and Russ R Salakhutdinov. Contrastive learning as goal-conditioned reinforcement learning. Advances in Neural Information Processing Systems, 35:35603–35620, 2022.
  • Fischer et al. [1997] Burkhart Fischer, Monica Biscaldi, and Stefan Gezeck. On the development of voluntary and reflexive components in human saccade generation. Brain research, 754(1-2):285–297, 1997.
  • Flechtner et al. [2002] Klaus-Malte Flechtner, Bruno Steinacher, Robert Sauer, and Arthur Mackert. Smooth pursuit eye movements of patients with schizophrenia and affective disorder during clinical treatment. European archives of psychiatry and clinical neuroscience, 252:49–53, 2002.
  • Fooken and Spering [2019] Jolande Fooken and Miriam Spering. Decoding go/no-go decisions from eye movements. Journal of vision, 19(2):5–5, 2019.
  • Foulsham and Underwood [2008] Tom Foulsham and Geoffrey Underwood. What can saliency models predict about eye movements? spatial and sequential aspects of fixations during encoding and recognition. Journal of vision, 8(2):6–6, 2008.
  • Freeman and Simoncelli [2011] Jeremy Freeman and Eero P Simoncelli. Metamers of the ventral stream. Nature neuroscience, 14(9):1195–1201, 2011.
  • Glaholt and Reingold [2011] Mackenzie G Glaholt and Eyal M Reingold. Eye movement monitoring as a process tracing methodology in decision making research. Journal of Neuroscience, Psychology, and Economics, 4(2):125, 2011.
  • Gluth et al. [2020] Sebastian Gluth, Nadja Kern, Maria Kortmann, and Cécile L Vitali. Value-based attention but not divisive normalization influences decisions with multiple alternatives. Nature human behaviour, 4(6):634–645, 2020.
  • Godlove and Schall [2016] David C Godlove and Jeffrey D Schall. Microsaccade production during saccade cancelation in a stop-signal task. Vision research, 118:5–16, 2016.
  • Goettker and Gegenfurtner [2021] Alexander Goettker and Karl R Gegenfurtner. A change in perspective: The interaction of saccadic and pursuit eye movements in oculomotor control and perception. Vision Research, 188:283–296, 2021.
  • Gong et al. [2024] Jiaqi Gong, Shengting Cao, Soroush Korivand, and Nader Jalili. Reconstructing human gaze behavior from eeg using inverse reinforcement learning. Smart Health, 32:100480, 2024.
  • Gupta et al. [2021] Shashi Kant Gupta, Mengmi Zhang, Chia-Chien Wu, Jeremy Wolfe, and Gabriel Kreiman. Visual search asymmetry: Deep nets and humans share similar inherent biases. Advances in neural information processing systems, 34:6946–6959, 2021.
  • Haarnoja et al. [2018] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. PMLR, 2018.
  • Hamker [2005] Fred H Hamker. The reentry hypothesis: the putative interaction of the frontal eye field, ventrolateral prefrontal cortex, and areas v4, it for attention and eye movement. Cerebral cortex, 15(4):431–447, 2005.
  • Hansen et al. [2021] Nicklas Hansen, Hao Su, and Xiaolong Wang. Stabilizing deep q-learning with convnets and vision transformers under data augmentation. Advances in neural information processing systems, 34:3680–3693, 2021.
  • Hansen et al. [2024] Nicklas Hansen, Jyothir SV, Vlad Sobal, Yann LeCun, Xiaolong Wang, and Hao Su. Hierarchical world models as visual whole-body humanoid controllers. arXiv preprint arXiv:2405.18418, 2024.
  • Henderson [1992] John M Henderson. Object identification in context: the visual processing of natural scenes. Canadian Journal of Psychology/Revue canadienne de psychologie, 46(3):319, 1992.
  • Hu et al. [2011] Frank K Hu, Arthur G Samuel, and Agnes S Chan. Eliminating inhibition of return by changing salient nonspatial attributes in a complex environment. Journal of Experimental Psychology: General, 140(1):35, 2011.
  • Hu et al. [2017] Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kate Saenko. Learning to reason: End-to-end module networks for visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 804–813, 2017.
  • Itti and Koch [2000] Laurent Itti and Christof Koch. A saliency-based search mechanism for overt and covert shifts of visual attention. Vision research, 40(10-12):1489–1506, 2000.
  • Jiang et al. [2020] Ming Jiang, Shi Chen, Jinhui Yang, and Qi Zhao. Fantastic answers and where to find them: Immersive question-directed visual attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2980–2989, 2020.
  • Joo et al. [2016] Sung Jun Joo, Leor N Katz, and Alexander C Huk. Decision-related perturbations of decision-irrelevant eye movements. Proceedings of the National Academy of Sciences, 113(7):1925–1930, 2016.
  • Kawaguchi et al. [2018] Katsuhisa Kawaguchi, Stephane Clery, Paria Pourriahi, Lenka Seillier, Ralf M Haefner, and Hendrikje Nienborg. Differentiating between models of perceptual decision making using pupil size inferred confidence. Journal of Neuroscience, 38(41):8874–8888, 2018.
  • Kingma [2014] Diederik P Kingma. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kornylo et al. [2003] Krista Kornylo, Natalie Dill, Melissa Saenz, and Richard J Krauzlis. Canceling of pursuit and saccadic eye movements in humans and monkeys. Journal of Neurophysiology, 89(6):2984–2999, 2003.
  • Kovacs and Julesz [1993] Ilona Kovacs and Bela Julesz. A closed curve is much more than an incomplete one: effect of closure in figure-ground segmentation. Proceedings of the National Academy of Sciences, 90(16):7495–7497, 1993.
  • Krajbich and Rangel [2011] Ian Krajbich and Antonio Rangel. Multialternative drift-diffusion model predicts the relationship between visual fixations and choice in value-based decisions. Proceedings of the National Academy of Sciences, 108(33):13852–13857, 2011.
  • Krajbich et al. [2010] Ian Krajbich, Carrie Armel, and Antonio Rangel. Visual fixations and the computation and comparison of value in simple choice. Nature neuroscience, 13(10):1292–1298, 2010.
  • Krauzlis [2005] Richard J Krauzlis. The control of voluntary eye movements: new perspectives. The Neuroscientist, 11(2):124–137, 2005.
  • Kristjánsson and Driver [2008] Árni Kristjánsson and Jon Driver. Priming in visual search: Separating the effects of target repetition, distractor repetition and role-reversal. Vision Research, 48(10):1217–1232, 2008.
  • Krupinski [2010] Elizabeth A Krupinski. Current perspectives in medical image perception. Attention, Perception, & Psychophysics, 72(5):1205–1217, 2010.
  • Laud and DeJong [2003] Adam Laud and Gerald DeJong. The influence of reward on the speed of reinforcement learning: An analysis of shaping. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 440–447, 2003.
  • Lee et al. [2023] Seungjae Lee, Daesol Cho, Jonghae Park, and H Jin Kim. Cqm: Curriculum reinforcement learning with a quantized world model. Advances in Neural Information Processing Systems, 36:78824–78845, 2023.
  • Lee and Mumford [2003] Tai Sing Lee and David Mumford. Hierarchical bayesian inference in the visual cortex. JOSA a, 20(7):1434–1448, 2003.
  • Leigh and Zee [2015] R John Leigh and David S Zee. The neurology of eye movements. Oxford University Press, USA, 2015.
  • Lencer et al. [2003] Rebekka Lencer, Katja Trillenberg-Krecker, Eberhard Schwinger, and Volker Arolt. Schizophrenia spectrum disorders and eye tracking dysfunction in singleton and multiplex schizophrenia families. Schizophrenia Research, 60(1):33–45, 2003.
  • Liechty et al. [2003] John Liechty, Rik Pieters, and Michel Wedel. Global and local covert visual attention: Evidence from a bayesian hidden markov model. Psychometrika, 68:519–541, 2003.
  • Lin et al. [2020] Zhongqiao Lin, Chechang Nie, Yuanfeng Zhang, Yang Chen, and Tianming Yang. Evidence accumulation for value computation in the prefrontal cortex during decision making. Proceedings of the National Academy of Sciences, 117(48):30728–30737, 2020.
  • Lisberger [2015] Stephen G Lisberger. Visual guidance of smooth pursuit eye movements. Annual review of vision science, 1(1):447–468, 2015.
  • Liu et al. [2023] Yanjun Liu, Jeremy M Wolfe, and Jennifer Trueblood. The impact of risk and prevalence on foraging behavior in hybrid visual search. In Proceedings of the Annual Meeting of the Cognitive Science Society, 2023.
  • Ma et al. [2023] Weiyu Ma, Qirui Mi, Xue Yan, Yuqiao Wu, Runji Lin, Haifeng Zhang, and Jun Wang. Large language models play starcraft ii: Benchmarks and a chain of summarization approach. arXiv preprint arXiv:2312.11865, 2023.
  • MacLean and Giesbrecht [2015] Mary H MacLean and Barry Giesbrecht. Irrelevant reward and selection histories have different influences on task-relevant attentional selection. Attention, Perception, & Psychophysics, 77:1515–1528, 2015.
  • Mao et al. [2016] Hongzi Mao, Mohammad Alizadeh, Ishai Menache, and Srikanth Kandula. Resource management with deep reinforcement learning. In Proceedings of the 15th ACM workshop on hot topics in networks, pages 50–56, 2016.
  • Maunsell and Treue [2006] John HR Maunsell and Stefan Treue. Feature-based attention in visual cortex. Trends in neurosciences, 29(6):317–322, 2006.
  • McSorley and McCloy [2009] Eugene McSorley and Rachel McCloy. Saccadic eye movements as an index of perceptual decision-making. Experimental brain research, 198:513–520, 2009.
  • Miconi et al. [2015] Thomas Miconi, Laura Groomes, and Gabriel Kreiman. There’s waldo! a normalization model of visual search predicts single-trial human fixations in an object search task. Cerebral cortex, 26(7):3064–3082, 2015.
  • Mirpour and Bisley [2021] Koorosh Mirpour and James W Bisley. The roles of the lateral intraparietal area and frontal eye field in guiding eye movements in free viewing search behavior. Journal of Neurophysiology, 2021.
  • Missal and Heinen [2017] Marcus Missal and Stephen J Heinen. Stopping smooth pursuit. Philosophical Transactions of the Royal Society B: Biological Sciences, 372(1718):20160200, 2017.
  • Mnih [2013] Volodymyr Mnih. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  • Mnih [2016] Volodymyr Mnih. Asynchronous methods for deep reinforcement learning. arXiv preprint arXiv:1602.01783, 2016.
  • Mnih et al. [2014] Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. Recurrent models of visual attention. Advances in neural information processing systems, 27, 2014.
  • Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
  • Mondal et al. [2023] Sounak Mondal, Zhibo Yang, Seoyoung Ahn, Dimitris Samaras, Gregory Zelinsky, and Minh Hoai. Gazeformer: Scalable, effective and fast prediction of goal-directed human attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1441–1450, 2023.
  • Mormann and Russo [2021] Milica Mormann and J Edward Russo. Does attention increase the value of choice alternatives? Trends in cognitive sciences, 25(4):305–315, 2021.
  • Munoz and Coe [2011] Douglas P Munoz and Brian C Coe. Saccade, search and orient–the neural control of saccadic eye movements, 2011.
  • Murphy et al. [2021] Peter R Murphy, Niklas Wilming, Diana C Hernandez-Bocanegra, Genis Prat-Ortega, and Tobias H Donner. Adaptive circuit dynamics across human cortex during evidence accumulation in changing environments. Nature neuroscience, 24(7):987–997, 2021.
  • Narvekar [2017] Sanmit Narvekar. Curriculum learning in reinforcement learning. In IJCAI, pages 5195–5196, 2017.
  • Narvekar et al. [2020] Sanmit Narvekar, Bei Peng, Matteo Leonetti, Jivko Sinapov, Matthew E Taylor, and Peter Stone. Curriculum learning for reinforcement learning domains: A framework and survey. Journal of Machine Learning Research, 21(181):1–50, 2020.
  • Pardyl et al. [2024] Adam Pardyl, Michał Wronka, Maciej Wołczyk, Kamil Adamczewski, Tomasz Trzciński, and Bartosz Zieliński. Adaglimpse: Active visual exploration with arbitrary glimpse position and scale. arXiv preprint arXiv:2404.03482, 2024.
  • Plank and James [2008] MJ Plank and A James. Optimal foraging: Lévy pattern or process? Journal of the Royal Society Interface, 5(26):1077–1086, 2008.
  • Radant and Hommer [1992] Allen D Radant and Daniel W Hommer. A quantitative analysis of saccades and smooth pursuit during visual pursuit tracking: A comparison of schizophrenics with normals and substance abusing controls. Schizophrenia research, 6(3):225–235, 1992.
  • Rafailov et al. [2024] Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
  • Rolfs [2009] Martin Rolfs. Microsaccades: small steps on a long way. Vision research, 49(20):2415–2441, 2009.
  • Ross et al. [2001] Randal G Ross, Ann Olincy, Gary Zerbe, and Allen Radant. Which duration of postsaccadic slowing identifies anticipatory saccades during smooth pursuit eye movements? Psychophysiology, 38(2):325–333, 2001.
  • Roy-Byrne et al. [1995] Peter Roy-Byrne, Allen Radant, Dane Wingerson, and Deborah S Cowley. Human oculomotor function: reliability and diurnal variation. Biological Psychiatry, 38(2):92–97, 1995.
  • Sauter et al. [2021] Marian Sauter, Nina M Hanning, Heinrich R Liesefeld, and Hermann J Müller. Post-capture processes contribute to statistical learning of distractor locations in visual search. Cortex, 135:108–126, 2021.
  • Savva et al. [2019] Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9339–9347, 2019.
  • Schulman et al. [2015] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
  • Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • Seideman et al. [2018] Joshua A Seideman, Terrence R Stanford, and Emilio Salinas. Saccade metrics reflect decision-making dynamics during urgent choices. Nature communications, 9(1):2907, 2018.
  • Seo et al. [2023] Younggyo Seo, Danijar Hafner, Hao Liu, Fangchen Liu, Stephen James, Kimin Lee, and Pieter Abbeel. Masked world models for visual control. In Conference on Robot Learning, pages 1332–1344. PMLR, 2023.
  • Shadmehr et al. [2019] Reza Shadmehr, Thomas R Reppert, Erik M Summerside, Tehrim Yoon, and Alaa A Ahmed. Movement vigor as a reflection of subjective economic utility. Trends in neurosciences, 42(5):323–336, 2019.
  • Silver et al. [2016] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
  • Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • Smith and Henderson [2011] Tim J Smith and John M Henderson. Does oculomotor inhibition of return influence fixation probability during scene search? Attention, Perception, & Psychophysics, 73:2384–2398, 2011.
  • Smyrnis et al. [2002] N Smyrnis, I Evdokimidis, N Stefanis, T Constantinidis, D Avramopoulos, C Theleritis, C Paximadis, C Efstratiadis, G Kastrinakis, and C Stefanis. The antisaccade task in a sample of 2,006 young males: Ii. effects of task parameters. Experimental brain research, 147:53–63, 2002.
  • Sommer and Wurtz [2004] Marc A Sommer and Robert H Wurtz. What the brain stem tells the frontal cortex. i. oculomotor signals sent from superior colliculus to frontal eye field via mediodorsal thalamus. Journal of neurophysiology, 91(3):1381–1402, 2004.
  • Song et al. [2019] Mingyu Song, Xingyu Wang, Hang Zhang, and Jian Li. Proactive information sampling in value-based decision-making: Deciding when and where to saccade. Frontiers in human neuroscience, 13:35, 2019.
  • Spering [2022] Miriam Spering. Eye movements as a window into decision-making. Annual review of vision science, 8(1):427–448, 2022.
  • Stanford and Salinas [2021] Terrence R Stanford and Emilio Salinas. Urgent decision making: resolving visuomotor interactions at high temporal resolution. Annual Review of Vision Science, 7(1):323–348, 2021.
  • Stüttgen et al. [2012] Peter Stüttgen, Peter Boatwright, and Robert T Monroe. A satisficing choice model. Marketing Science, 31(6):878–899, 2012.
  • Sui et al. [2023] Xiangjie Sui, Yuming Fang, Hanwei Zhu, Shiqi Wang, and Zhou Wang. Scandmm: A deep markov model of scanpath prediction for 360deg images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6989–6999, 2023.
  • ’t Hart et al. [2013] Bernard Marius ’t Hart, Hannah Claudia Elfriede Fanny Schmidt, Ingo Klein-Harmeyer, and Wolfgang Einhäuser. Attention in natural scenes: contrast affects rapid visual processing and fixations alike. Philosophical Transactions of the Royal Society B: Biological Sciences, 368(1628):20130067, 2013.
  • Taylor and Badcock [1988] Steven Taylor and David Badcock. Processing feature density in preattentive perception. Perception & psychophysics, 44:551–562, 1988.
  • Thura et al. [2014] David Thura, Ignasi Cos, Jessica Trung, and Paul Cisek. Context-dependent urgency influences speed–accuracy trade-offs in decision-making and movement execution. Journal of Neuroscience, 34(49):16442–16454, 2014.
  • Toole and Fogt [2021] Andrew J Toole and Nick Fogt. head and eye movements and gaze tracking in baseball batting. Optometry and Vision Science, 98(7):750–758, 2021.
  • Torralba et al. [2006] Antonio Torralba, Aude Oliva, Monica S Castelhano, and John M Henderson. Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. Psychological review, 113(4):766, 2006.
  • Towal et al. [2013] R Blythe Towal, Milica Mormann, and Christof Koch. Simultaneous modeling of visual saliency and value computation improves predictions of economic choice. Proceedings of the National Academy of Sciences, 110(40):E3858–E3867, 2013.
  • Van der Lans et al. [2008] Ralf Van der Lans, Rik Pieters, and Michel Wedel. Eye-movement analysis of search effectiveness. Journal of the American Statistical Association, 103(482):452–461, 2008.
  • Van Dyck et al. [2021] Leonard Elia Van Dyck, Roland Kwitt, Sebastian Jochen Denzler, and Walter Roland Gruber. Comparing object recognition in humans and deep convolutional neural networks—an eye tracking study. Frontiers in Neuroscience, 15:750639, 2021.
  • Van Slooten et al. [2019] Joanne C Van Slooten, Sara Jahfari, and Jan Theeuwes. Spontaneous eye blink rate predicts individual differences in exploration and exploitation during reinforcement learning. Scientific reports, 9(1):17436, 2019.
  • Watson and Humphreys [1997] Derrick G Watson and Glyn W Humphreys. Visual marking: prioritizing selection for new objects by top-down attentional inhibition of old objects. Psychological review, 104(1):90, 1997.
  • Wolfe et al. [2015] Jeremy Wolfe, Matthew Cain, Krista Ehinger, and Trafton Drew. Guided search 5.0: Meeting the challenge of hybrid search and multiple-target foraging. Journal of vision, 15(12):1106–1106, 2015.
  • Wolfe [2012] Jeremy M Wolfe. Saved by a log: How do humans perform hybrid visual and memory search? Psychological Science, 23(7):698–703, 2012.
  • Wolfe and DiMase [2003] Jeremy M Wolfe and Jennifer S DiMase. Do intersections serve as basic features in visual search? Perception, 32(6):645–656, 2003.
  • Wolfe and Horowitz [2017] Jeremy M Wolfe and Todd S Horowitz. Five factors that guide attention in visual search. Nature human behaviour, 1(3):0058, 2017.
  • Wolfe et al. [2016] Jeremy M Wolfe, Avigael M Aizenman, Sage EP Boettcher, and Matthew S Cain. Hybrid foraging search: Searching for multiple instances of multiple types of target. Vision research, 119:50–59, 2016.
  • Wolfe et al. [2018] Jeremy M Wolfe, Matthew S Cain, and Abla Alaoui-Soce. Hybrid value foraging: How the value of targets shapes human foraging behavior. Attention, Perception, & Psychophysics, 80:609–621, 2018.
  • Wolfe et al. [2019] Jeremy M Wolfe, Matthew S Cain, and Avigael M Aizenman. Guidance and selection history in hybrid foraging visual search. Attention, Perception, & Psychophysics, 81:637–653, 2019.
  • Wu and Wolfe [2019] Chia-Chien Wu and Jeremy M Wolfe. Eye movements in medical image perception: a selective review of past, present and future. Vision, 3(2):32, 2019.
  • Yang et al. [2020] Zhibo Yang, Lihan Huang, Yupei Chen, Zijun Wei, Seoyoung Ahn, Gregory Zelinsky, Dimitris Samaras, and Minh Hoai. Predicting goal-directed human attention using inverse reinforcement learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 193–202, 2020.
  • Yang et al. [2024] Zhibo Yang, Sounak Mondal, Seoyoung Ahn, Ruoyu Xue, Gregory Zelinsky, Minh Hoai, and Dimitris Samaras. Unifying top-down and bottom-up scanpath prediction using transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1683–1693, 2024.
  • Yoon et al. [2018] Tehrim Yoon, Robert B Geary, Alaa A Ahmed, and Reza Shadmehr. Control of movement vigor and decision making during foraging. Proceedings of the National Academy of Sciences, 115(44):E10476–E10485, 2018.
  • Zhang et al. [2017] Jinxia Zhang, Xue Gong, Daryl Fougnie, and Jeremy M Wolfe. How humans react to changing rewards during visual foraging. Attention, Perception, & Psychophysics, 79:2299–2309, 2017.
  • Zhang et al. [2018] Mengmi Zhang, Jiashi Feng, Keng Teck Ma, Joo Hwee Lim, Qi Zhao, and Gabriel Kreiman. Finding any waldo with zero-shot invariant and efficient visual search. Nature communications, 9(1):3730, 2018.
  • Zhang et al. [2022] Mengmi Zhang, Marcelo Armendariz, Will Xiao, Olivia Rose, Katarina Bendtz, Margaret Livingstone, Carlos Ponce, and Gabriel Kreiman. Look twice: A generalist computational model predicts return fixations across tasks and species. PLoS computational biology, 18(11):e1010654, 2022.
  • Zhou et al. [2024] Guangyao Zhou, Wenhong Tian, Rajkumar Buyya, Ruini Xue, and Liang Song. Deep reinforcement learning-based methods for resource scheduling in cloud computing: A review and future directions. Artificial Intelligence Review, 57(5):124, 2024.

Supplementary Material for
\thetitle

S1 Implementation details of Hybrid Visual Foraging

S1.1 Human psychophysics experiments

The search grid contained either 90, 105, or 120 items, and the positions of these items shuffled every 3 seconds to prevent a fixed reading strategy from the top left to the bottom right of the screen.

Each experiment consists of 10 blocks, where the target objects and their values remain consistent across trials within the same block, but the prevalence of targets as well as the number of target objects may vary across the trials within the block. In each foraging trial, subjects searched for N{1,2,4}N\in\{1,2,4\} target objects, each with a varying number of target instances. Targets and distractors in the hybrid foraging search arrays were randomly selected from a pool of 2,400 unique items used in [14]. The order of the blocks was counterbalanced across subjects.

Each experiment takes 1 hour to complete. A total of 15 subjects were recruited, yielding 750 trials, containing 50514 eye fixations and 12851 mouse clicks. All the experiments are conducted with the subjects’ informed consent and according to the protocols approved by the Institutional Review Board of our institution. Each subject was compensated with monetary rewards.

S1.2 Foraging environments for AI models

We sub-sample various combinations of these experimental parameters for the procedural generation of foraging environments: First, the total number of items on the search array is fixed at 105, where 73 serve as distractors, and 32 are designated as target instances. Second, a fixed set of 4 items is randomly selected from the pool of 2,400 items and used as the set of target items throughout the AI model training. Third, there are always 4 target objects present on the search arrays. Fourth, the prevalence ratio among these 4 target items is randomly determined. Finally, the values of the four target items are consistently set at 2, 4, 8, and 12.

S1.3 In-domain and out-of-domain test conditions for AI models

To benchmark AI model performance in hybrid foraging tasks, we introduce two in-domain hybrid foraging conditions that align with the distribution of the training environments that the AI models were optimized to solve. To assess whether the AI models can generalize to out-of-distribution (OOD) hybrid visual search tasks, where experimental parameters differ from those encountered during training, we introduced five out-of-distribution conditions. Below is the summary of all seven conditions:
(1) In-domain Uneven Value, Equal Prevalence (UnValEqPre) The prevalence of all four targets was set at 25%, while their values varied, with one target worth 2, another 4, a third 8, and the fourth 16.
(2) In-domain Uneven Value, Unequal Prevalence (UnValUnPre) The first target had a value of 2 with 53% frequency, the second a value of 4 with 27%, the third a value of 8 with 13%, and the fourth a value of 16 with 7%.
(3) OOD - Even Value, UnEqual Prevalence (EqValUnPre) Each of the four target objects had a value of 8, but their prevalence varied, with 53% 27% 13%, and 7% respectively.
(4) OOD - Unseen target objects (UTargets) We replace the target and distractor objects from the pool of 2400 items used for training with unseen items, while maintaining the other experimental parameters.
(5) OOD - Unseen value combinations (UValues) The prevalence of all four targets was randomized, and their absolute values exceeded the range used for training, with their relative values changing in either arithmetic or geometric series. Specifically, the value combinations included (1, 2, 3, 4), (1, 2, 4, 8), (8, 9, 10, 11), (8, 16, 32, 64), (16, 18, 20, 22), and (16, 32, 64, 128).
(6) OOD - Unseen total item numbers (UItemNum) Unlike during training, when the total number of items on the screen was consistently 120, the search arrays were populated with either 90 or 105 items.
(7) OOD - Unseen target object sizes (USetSize) The set size of target objects was manipulated to include either one or two; specifically, the single target object was valued at 4, while the two target objects were valued at 4 and 16.

S2 Reinforcement learning

We recall the Markov decision process (MDP) framework with finite state space 𝒮\mathcal{S} and action space 𝒜\mathcal{A}. An MDP is defined as =(𝒮,𝒜,Pr,r,γ)\mathcal{M}=(\mathcal{S},\mathcal{A},Pr,r,\gamma), where Pr:𝒮×𝒜Δ(𝒮)Pr:\mathcal{S}\times\mathcal{A}\rightarrow\Delta(\mathcal{S}) is the transition function, r:𝒮×𝒜r:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R} is the reward function, and γ(0,1)\gamma\in(0,1) is the discount factor. Given an initial state s0s_{0}, the goal of reinforcement learning (RL) is to learn a policy π\pi that maps a state s𝒮s\in\mathcal{S} to a distribution π(s)\pi(\cdot\mid s) over the action space, aiming to maximize the expected cumulative discounted reward.

For any policy π\pi, the action-value function Qπ(s,a)Q^{\pi}(s,a) represents the expected return starting from state ss, taking action aa, and thereafter following policy π\pi. It is defined as Qπ(s,a)=𝔼π,Pr[t=0γhr(st,at)s0=s,a0=a]Q^{\pi}(s,a)=\mathbb{E}_{\pi,Pr}\left[\sum_{t=0}^{\infty}\gamma^{h}r(s_{t},a_{t})\mid s_{0}=s,a_{0}=a\right], where 𝔼π,Pr()\mathbb{E}_{\pi,Pr}(\cdot) denotes the expectation over trajectories generated by following π\pi under the transition dynamics PrPr. The state-value function Vsπ(s)V_{s}^{\pi}(s) is the expected return starting from state ss and following π\pi, while the advantage function Aπ(s,a)A^{\pi}(s,a) is given by Aπ(s,a)=Qπ(s,a)Vsπ(s)A^{\pi}(s,a)=Q^{\pi}(s,a)-V_{s}^{\pi}(s), quantifying the relative advantage of taking action aa in state ss under policy π\pi.

S2.1 Proximal Policy Optimization

Proximal Policy Optimization (PPO) is a policy gradient method designed to improve the stability and efficiency of policy updates. PPO ([106]) uses a surrogate objective function with a clipping mechanism to prevent large, destabilizing updates. The surrogate objective function is defined as:

LCLIP=𝔼πθold[min(πθ(as)πθold(as)A~πθold(s,a),clip(πθ(as)πθold(as),1δ0,1+δ0)A~πθold(s,a))].L^{\text{CLIP}}=\mathbb{E}_{\pi_{\theta_{\text{old}}}}\left[\min\left(\frac{\pi_{\theta}(a\mid s)}{\pi_{\theta_{\text{old}}}(a\mid s)}\tilde{A}^{\pi_{\theta_{\text{old}}}}(s,a),\operatorname{clip}\left(\frac{\pi_{\theta}(a\mid s)}{\pi_{\theta_{\text{old}}}(a\mid s)},1-\delta_{0},1+\delta_{0}\right)\tilde{A}^{\pi_{\theta_{\text{old}}}}(s,a)\right)\right].

where A~πθ\tilde{A}^{\pi_{\theta}} is an estimate of the advantage function, and δ0\delta_{0} is a hyperparameter controlling the extent of clipping.

In this formulation, the first term inside the min\min operator is the standard policy gradient objective, while the second term applies the clipping mechanism to ensure that the policy update does not result in excessively large changes. This clipping mechanism is crucial for maintaining the stability of the learning process.

S2.2 Additional training and implementation details

In practice, rather than learning two separate policies for actions at different times i.e., the mouse click at tt and the fixation at t+1t+1, we modify the click policy πc(|s)\pi_{c}(\cdot|s) to output the binary click decision at t+1t+1, aligning it with the fixation policy. Empirically, this leads to more efficient training and faster convergence. Importantly, this modification does not alter the hybrid foraging setup, as VF can fixate on the same grid cell consecutively. In other words, VF may initially decide not to click the item fixated at t+1t+1 but can later decide to click it by fixating on the same item again at the next time step.

The search image ISI_{S} has a resolution of 1024×10241024\times 1024 pixels, while the target images ITI_{T} are 64×6464\times 64 pixels, corresponding to the size of one cell within a 16×\times16-sized grid in ISI_{S}. The search feature map ϕS\phi_{S} has dimensions 32×32×51232\times 32\times 512, while the target feature maps ϕT1:N\phi_{T}^{1:N} are 2×2×5122\times 2\times 512. We implemented the target modulation function \mathcal{M} with a stride of 2, resulting in MFM_{F} with dimensions 16×16×N16\times 16\times N, where the spatial size matches the grid size of the search image.

Our VF was trained over 3 million timesteps in the first stage, taking approximately 3 days, and over 0.6 million timesteps in the second stage, taking approximately 1 day. All training was conducted on a single NVIDIA RTX A6000 GPU.

S2.3 Deep Q-leaning

Value-based reinforcement learning method solves MDP problem by getting an optimal value function. The optimal value function is defined by Vs(s)=supπVsπ(s)V_{s}^{*}(s)=\sup_{\pi}V_{s}^{\pi}(s) and similarly Q(s,a)=supπQπ(s,a)Q^{*}(s,a)=\sup_{\pi}Q^{\pi}(s,a). We use deep Q-learning (DQN) as a baseline method, which obtains QQ^{*} based on the update Qi+1(st,at)=(1Q_{i+1}\left(s_{t},a_{t}\right)=(1- αt)Qi(st,at)+αt(rt+γmaxaQi(st+1,a))\left.\alpha_{t}\right)Q_{i}\left(s_{t},a_{t}\right)+\alpha_{t}\left(r_{t}+\gamma\max_{a}Q_{i}\left(s_{t+1},a\right)\right), where αt(0,1)\alpha_{t}\in(0,1) is the learning rate. We employ the ε\varepsilon-greedy approach for action selection based on a value function, which means that we pick argmaxaQi(s,a)\arg\max_{a}Q_{i}(s,a) with 1ε1-\varepsilon probability and a random action with probability ε\varepsilon.

As our baseline, we do not incorporate target feature modulation or target value modulation. Instead, we designed a deep neural network (DNN) to predict the value function in an end-to-end fashion. This DNN takes as input a search image, target images, and target values. It uses two 2D-CNNs to extract features from the search and target images, respectively, then concatenates the search features, target features, and target values. An MLP with three fully connected blocks outputs an approximate state-action value for each action. Following the standard DQN used in [89], our approach incorporates the key techniques of target networks and experience replays.

S2.4 PPO Hyperparameters

This hyperparameters used at two training stages are listed as follow:

Training stage Stage 1 Stage 2
Discount (γ\gamma) 0.99 0.99
GAE parameter (λ\lambda) [105] 0.95 0.95
Batch size 512 512
Epochs 5 1
PPO clip range 0.05 0.05
Entropy coefficient [87] 0 0.001
Learning rate 2e-4 2e-4
Table S1: PPO Hyperparameters.

S3 Additional experiment results

S3.1 Human motor response

Refer to caption
Figure S1: Human reaction time in a trial as a function of click numbers. We recorded clicks in all subjects’ trials and showed the result of the linear fit.
Refer to caption
Figure S2: Human response time in a trial as a function of fixation numbers. We recorded fixations in all subjects’ trials and showed the result of the linear fit.

S3.2 Human fixation duration

Fixation durations are longer on targets with higher values. We also investigate human fixation durations on targets with varying values under the UnValEqPre and UnValUnPre conditions. From Appendix Fig. S3, surprisingly, we found that humans tend to spend more time fixating on higher-value targets compared to those with lower values. For example, under the UnValEqPre condition, the mean eye fixation duration is 344 milliseconds on targets valued at 16, while the duration is 309 milliseconds on targets valued at 2. This may be attributed to the enhancement of learning and memory, where longer fixation durations facilitate cognitive processing and reinforce associations between previous decision-making strategies and positive outcomes.

Refer to caption
Figure S3: Eye fixation duration for different types of targets in UnValEqPre (T1: mean=309msmean=309ms, T2: mean=336msmean=336ms, T3: mean=353msmean=353ms, T4: mean=344msmean=344ms), UnValUnPre (T1: mean=302msmean=302ms, T2: mean=339msmean=339ms, T3: mean=325msmean=325ms, T4: mean=342msmean=342ms) and EqValEqPre (T1: mean=342msmean=342ms, T2: mean=346msmean=346ms, T3: mean=339msmean=339ms, T4: mean=347msmean=347ms). Fixation durations are significantly different for targets with different values in UnValEqPre condition (p=0.13p=0.13) and UnValUnPre (p=0.12p=0.12). Fixation durations are not significantly different for targets with same value in EqValEqPre (p=0.98p=0.98).

S3.3 Average reward within fixation area

Refer to caption
Figure S4: Mean rewards of all target objects within a radius of 1.5 degrees of visual angle around each fixation predicted by our VF model (UnValEqPre: mean=8.08mean=8.08, UnValUnPre: mean=3.60mean=3.60, and EqValEqPre: mean=3.00mean=3.00), made by human subjects (UnValEqPre: mean=3.30, UnValUnPre: mean=1.29, and EqValEqPre: mean=1.28) and predicted by the chance model (UnValEqPre: mean=2.75mean=2.75, UnValUnPre: mean=0.98mean=0.98, and EqValEqPre: mean=0.84mean=0.84). For all three conditions, both human subjects and our VF model tend to fixate on regions associated with average rewards significantly higher than that derived from random fixations. We conducted two-tailed t-tests. All p-values are below 0.01.

S3.4 Click behavior

Refer to caption
Figure S5: Proportion as a function of number of clicks for (A) humans in UnValEqPre, (B) humans in UnValUnPre, (C) VF model in UnValEqPre, and (D) VF model in UnValUnPre. Solid lines are click proportions of different types of targets. Dash lines are proportions of different targets that remain on screen. Colors indicate the target types.

S4 Future works

First, we observed that humans occasionally clicked on items they were not directly fixating on, while VF assumes eye movements always align with the locations at which foraging decisions are made. Second, a strong priming effect was evident in humans, especially when target values were equal, showing the long-lasting influence of prior experiences on human decisions. Our VF currently lacks the ability to model such long-term dependencies, as it does not have a working memory integrating reinforcements from past actions into current decisions. Third, in hybrid foraging, humans actively compare fixated items with those in memory, a process known as memory search. Our VF assumes perfect memory search, where all targets are compared to the fixated item simultaneously. Lastly, real-world environments may present additional challenges, such as target occlusions and physical constraints imposed by scene contexts. Extending the study of hybrid visual foraging beyond simplistic stimuli in controlled experimental settings remains an intriguing research direction.