Evaluating Atypical Gaze Patterns through Vision Models: The Case of Cortical Visual Impairment
Abstract
A wide range of neurological and cognitive disorders exhibit distinct behavioral markers aside from their clinical manifestations. Cortical Visual Impairment (CVI) is a prime example of such conditions, resulting from damage to visual pathways in the brain, and adversely impacting low- and high-level visual function. The characteristics impacted by CVI are primarily described qualitatively, challenging the establishment of an objective, evidence-based measure of CVI severity. To study those characteristics, we propose to create visual saliency maps by adequately prompting deep vision models with attributes of clinical interest. After extracting saliency maps for a curated set of stimuli, we evaluate fixation traces on those from children with CVI through eye tracking technology. Our experiments reveal significant gaze markers that verify clinical knowledge and yield nuanced discriminability when compared to those of age-matched control subjects. Using deep learning to unveil atypical visual saliency is an important step toward establishing an eye-tracking signature for severe neurodevelopmental disorders, like CVI.
I Introduction
Human behavior presents a complex interplay influenced by social surroundings, the environment, and an individual’s cognitive or biomedical conditions. There is hence an increasing interest in studying behavioral markers that could indicate neurological diseases (e.g., Alzheimer’s disease [1]), mental illness (e.g., depression [2]), developmental disorders (e.g., Autism Spectrum Disorder [3]) or affective patterns (e.g., mood [4], stress [5]). Aside from the profound implications to the quality and length of human life, such conditions incur a heavy financial burden for testing and treatment. The investigation of behavioral markers could thus assist specialists in both interpreting those conditions and diagnosing them with reduced costs. Behavioral Signal Processing (BSP) [6] and Artificial Intelligence (AI) technologies seek to quantify subtle nuances of behavioral signals that may serve as indicators of underlying cognitive or biomedical conditions. Various modalities have been researched to that end, with the most prominent being speech and paralinguistic patterns [7], facial expressions [3], movement and physiology, as quantified through specialized biosensors [5, 8].
In this paper, we apply this premise and use eye tracking technology to study Cortical Visual Impairment (CVI), one of the leading causes of pediatric visual impairment worldwide [9]. Yet, CVI lacks standardized, evidence-based methods that interpret its impacted visual characteristics. Eye tracking is a favorable behavioral modality, as it has been previously utilized to quantify atypical gaze patterns to predefined stimuli [10]. By scrutinizing gaze patterns, experts can gain a deeper understanding of how clinical populations perceive their environment [11]. Notably, people in the ASD spectrum report reduced attention to social stimuli such as the human face, voice and hand gestures [12, 13], whereas people with diagnosed depression show attentional biases to emotional faces [14]. In our case, eye tracking could quantify the visual function of CVI patients on images of higher-order attributes such as social interaction and texture [15, 16]. However, assessing human visual saliency requires a model for efficient extraction of saliency for the stimuli used for diagnosis. This is typically done through expert annotations, however these are time-consuming and prone to individual biases. Computational models [17, 18] have thus emerged to ease this process, yet they rarely capture high-level semantics.
In the following, we describe a machine learning-driven approach to create custom maps of visual saliency by prompting semantic and qualitative characteristics of CVI to multimodal vision models. We first outline a set of saliency markers related to CVI and an experiment setup to study those through measures of eye tracking. We engineer suitable prompts for attributes of interest and show that the derived saliency maps offer nuanced insights that are clinically relevant.
II Cortical Visual Impairment
Cortical, or cerebral, visual impairment (CVI) is a neurological condition that occurs due to damage or injury to the visual pathways in the brain, resulting in a range of visual deficits [19]. Common causes of CVI include prematurity with periventricular leukomalacia, hypoxic-ischemic encephalopathy, trauma, hydrocephalus, metabolic or genetic disorders [9]. CVI is the leading cause of pediatric visual impairment in developed countries, yet there is no standardized method of quantifying the impacted visual characteristics [20], and no evidence-based treatment option is available.
Despite the prevalence of CVI, the precise definition remains a topic of debate in the literature. Traditionally, the diagnosis of CVI required decreased visual acuity and/or visual field defects [21]. Most children diagnosed using this definition had profound visual impairment or blindness at diagnosis. More recently, the term CVI has been used more broadly to include children with deficits of higher-order visual function (such as recognition of abstract objects or faces), with or without intact visual acuity and visual fields. This has led to the consensus definition proposed by Sakki et al. [22] of CVI as “a verifiable visual dysfunction which cannot be attributed to disorders of the anterior visual pathways or any potentially co-occurring ocular impairment.”
Still, several unifying characteristics among children with CVI have been reported in the literature for decades, mainly based on qualitative descriptions. These include reduced contrast sensitivity [23] and spared color discrimination [24], selective sparing or limitation of motion processing [25], difficulties with visual crowding or complexity [15], specific deficits in recognizing abstract objects or faces (e.g. prosopagnosia) [16], difficulties with visually-guided orientation in space (topographic agnosia) [26], and variability in visual function based on individual and environmental factors [19]. Since most children with CVI have developmental delays that impair communication, identifying these deficits relies on assessment of visual behavior. Eye tracking is an attractive option to that end because, in addition to being objective and quantitative, the tracking protocols may be designed to assess a multitude of lower- and higher-order visual characteristics.

III Model-based Visual Saliency
Models of visual saliency traditionally include a set of handcrafted features extracted from the visual scene (images) through complex image processing techniques. Such features include color patterns or opponencies (e.g., red against green), edges or orientation patterns that have been shown to be relevant for human vision [27]. With the emergence of deep learning, there are more high-level attributes to account for, including shapes, depth and other human-inspired patterns. Neural network models like DeepGaze [17] have been trained on human fixations to predict an average image saliency map.
However, these bottom-up approaches are based solely on stimulus features. On the other hand, top-down human saliency takes into account human conditions, past experiences and intent to estimate fixation patterns [28]. Hence, CVI patients would be expected to diverge from baseline fixation estimations due to the associated visual impairment. Moreover, conditions like CVI require the extraction of saliency features that are challenging to extract with those methods, e.g., maps based on specific object categories, or based on non-material attributes like social interaction, motion, and complexity.
Multimodal learning, particularly vision-language models, have been proposed to ground the visual modality to text descriptions. CLIP [29] is among the first efficient multimodal models which include two unimodal encoders for visual and textual inputs respectively. CLIP is trained by minimizing the latent distance of paired images and text descriptions, whereas maximizing unpaired sample distances using the InfoNCE [30] contrastive objective. Here we adopt SegCLIP [31], a recent variant of CLIP which is specifically trained for text-driven image segmentation. This task is particularly suited for our study since the model output is a saliency map that corresponds to the segmentation of the prompted attributes.
Prompt Engineering SegCLIP incorporates a dual-encoder architecture containing a text and image encoder. The authors suggest three losses, including a reconstruction loss that trains the generated segmentation output and a contrastive loss that aligns text and image inputs. We use the pre-trained model in its inference mode and provide image inputs (our stimuli) along with free-form text descriptions of the attributes of interest (prompts). The model then aligns the two inputs and outputs a segmentation map that highlights the corresponding attribute in the image (hence, a saliency map). While SegCLIP is trained primarily on object-centric prompts, authors show in [31] that the model can generalize to action attributes like flying, driving, etc. Here we investigate the extent to which it can generalize to qualitative attributes, such as complexity or social concepts, and we engineer prompts to better reflect the nuances of the CVI characteristics.

IV Experimental Setup
IV-A Study Design and Participants
We prospectively recruited 42 children between 12 months to 12 years of age with a diagnosis of CVI. The diagnosis was made during a routine eye exam by a pediatric neuro-ophthalmologist, based on observed reduced or worse than expected visual acuity in children, given the severity of any ocular co-morbidity. Additionally, recruited children were required to have a known neurologic risk factor for CVI (e.g., prematurity with periventricular leukomalacia). We included children who had visual acuity sufficient to track images on a computer monitor and excluded those with photosensitive epilepsy and any ocular cause for decreased vision, except for mild atrophy. We also excluded children with abnormalities such as oculomotor apraxia that would prevent us from making inferences based on eye movements.
We also recruited 29 age-matched control subjects using a web-based recruitment service. Controls had no history of any neurologic or ophthalmologic condition, and underwent an eye exam to confirm normal visual acuity and ocular motility. The study was approved by the Children’s Hospital Los Angeles/University of Southern California Institutional Review Board (IRB) and adhered to the tenets of the Declaration of Helsinki and the US Health Insurance Portability and Accountability Act of 1996. Informed consent was obtained from the parent or legal guardian of all participants.
Eye tracking was performed using the SR Research EyeLink 1000 (Ottawa, Canada)111https://www.sr-research.com/eyelink-1000-plus/ desktop remote eye tracker, recording at 500 Hz. Participants were seated 60 cm from a computer monitor, wearing their habitual spectacles. The recording session began with 3-point calibration and proceeded for 10 minutes while the participants watched a series of still images and videos interspersed with stimuli for psychophysical tests [32, 33]. In this paper we focus on still images, including pictures of landscapes, people, animals, toys, and food, that were selected to span the range of characteristics believed to be affected in CVI (e.g., color, contrast, complexity, orientation, and human interaction). Both realistic and cartoon images were included. Each image was shown for a total of 2 seconds.
IV-B Eye Tracking Signal Analysis
The eye-tracking software outputs gaze coordinates, pupil size, and sample-level annotations for detected fixations (i.e., periods during which gaze remains at a particular location), saccades (i.e., rapid eye movements between fixations), and eye blinks. For an eye movement to be considered fixation, it has to remain stable (at most 0.1 deg) for at least 100ms. For an eye movement to be considered saccade, it should have a velocity of at least 30 deg/s at an amplitude change of at least 0.1 deg. Here we focus on the gaze trajectory projected on stimulus image coordinates, which we partition into fixation and saccade segments. We discard fixations shorter than 50 ms [34] and further remove any fixations positioned partially (at least 20%) out of the stimulus image range. Lastly, we averaged the respective traces of the two eyes when both were available. To facilitate our statistical analysis of CVI markers, we consider the resulting trajectories per image trial and compute the average saliency of the fixation traces.
IV-C Statistical Analysis of Saliency
We evaluate the aforementioned visual characteristics of CVI by comparing the extracted saliency features across groups. The saliency maps are extracted from SegCLIP using free language prompts. For instance, depth was approached using prompts like “depth”, “background” versus “foreground”, or “far away” versus “front”. Similarly, complexity was induced with prompts such as “complex scene”, or “complexity” versus “smoothness”. The generated grayscale maps were then smoothed with a gaussian filter and normalized to [0, 255]. We carried out a series of non-parametric Mann-Whitney tests to unveil significant differences between the two groups, accompanied by Cohen’s effect size and permutation testing (). We consider our results significant if .
Differential Saliency In the previous examples, the use of versus indicates differential saliency between opposite attributes of interest. This is done by extracting the saliency maps using two separate prompts, and then getting their subtraction (note that order is important) as the final map. We manually observed that this technique enhances the precision to attributes of interest and regularizes the effect of CVI patients producing significantly fewer fixations per trial (CTRL: , CVI: ), with those being scattered from the image center (center bias – CTRL: , CVI: ).


V Results & Discussion
We first verify that CVI patients demonstrate atypical visual saliency compared to the matched controls. We use DeepGaze II [17] to create saliency maps that estimate typical human gaze patterns for our stimuli. Indeed, CVI patients report an average fixation saliency of 53, compared to 158 for controls (), which verifies the assumption of atypical gaze patterns in CVI. In the following we analyze each of the prominent characteristics of CVI in terms of average fixation saliency measured across groups, and summarize the results in Figure 3.
Depth and Background For this experiment we extracted depth saliency maps by prompting for depth, with foreground information denoted with higher saliency. As shown in Figure 3a, patients with CVI focus significantly more to distant information in the image. In specific, the computed average is 107 for the control group, compared to 140 for CVI (). This result is also verified using depth estimation maps, extracted using MiDaS [18]. By plotting the corresponding fixation density (Figure 4) we observe that the depth bias comes primarily from CVI tendency to focus less on foreground than more to background saliency. Despite this tendency, CVI subjects need more time to produce a valid background fixation (), while their overall latency to salient fixations does not differ ().
Orientation and Motion Consistent with the hypothesis, CVI subjects focus on still rather than moving objects in an image (), as shown from the differential saliency reported in Figure 3b. Regarding orientation information, SegCLIP outputs the most representative maps when prompted with bars or stripes, as compared to more abstract prompts like orientation, or direction and directional patterns. This is expected as the model performs better on object-centric prompts. As seen in Figure 3c, CVI patients seem to fixate more to such patterns than the control group ().
Contrast and Complexity For this experiment we tested the assumption that CVI produces functional challenges regarding scenes of high complexity, with a generally reduced contrast sensitivity [23]. As shown in Figure 3d, CVI subjects fixate significantly less on complex (positive saliency) rather than smooth (negative saliency) parts of the corresponding stimuli (). This symmetric difference is evident also from the fixation density produced by the 2 groups (Figure 4). Similarly, testing on a subset of stimuli characterized by texture reveals that CVI subjects tend to avoid those regions, reporting an average saliency of 98 compared to 110 for the controls ().
Faces and Social Stimuli CVI is assumed to distract gaze from social stimuli. For this experiment we isolate multiple groups of stimuli images corresponding to different concepts of human interaction: faces, hands, eyes, etc. In this case, SegCLIP was prompted with single-object descriptions to yield representative saliency maps. Prompting with abstract concepts like human interaction, communication or emotion did not produce consistent maps for our stimuli. The results of these experiments are summarized in Figure 3e-f. CVI subjects are shown to fixate less than average on human faces (), while also the latency of these fixations is longer (). No difference emerges when it comes to fixations on eyes or mouth (). Interestingly though, as shown in Figure 3f, children with CVI tend to focus more on the non-social human characteristics (). An additional within-group difference is that controls produce longer fixations at higher saliency (pearson , ), whereas CVIs report uniform durations ().
VI Conclusion
A novel, machine learning-based approach to create maps of visual saliency was proposed, focusing on the assessment of qualitative markers of eye-tracking activity in children with CVI. To that end, we leveraged the robustness of large vision models in segmenting image patches conditioned on textual descriptions of CVI characteristics. The availability of those saliency maps assisted us in verifying and reinforcing our understanding of a challenging neurological impairment with no standardized diagnostic methods. In the future, the proposed approach could be used in designing individualized interventions and assessing their efficacy in clinical trials.
References
- [1] Joshua R Ehrlich, Tochukwu Ndukwe, Sandy Chien, and Jinkook Lee. The association of cognitive and visual function in a nationally representative study of older adults in India. Neuroepidemiology, 2021.
- [2] Armen C Arevian, Daniel Bone, Nikolaos Malandrakis, Victor R Martinez, Kenneth B Wells, David J Miklowitz, and Shrikanth Narayanan. Clinical state tracking in serious mental illness through computational analysis of speech. PLoS one, 15(1):e0225695, 2020.
- [3] Tanaya Guha, Zhaojun Yang, Ruth Grossman, and Shrikanth Narayanan. A computational study of expressive facial dynamics in children with autism. IEEE Trans. on Affective Computing, 9(1):14–20, 2016.
- [4] Bradley M Appelhans and Linda J Luecken. Heart rate variability as an index of regulated emotional responding. Review of general psychology, 10(3):229–240, 2006.
- [5] Nis Hjortskov et al. The effect of mental stress on heart rate variability and blood pressure during computer work. European journal of applied physiology, 92:84–89, 2004.
- [6] Daniel Bone, Chi-Chun Lee, Theodora Chaspari, James Gibson, and Shrikanth Narayanan. Signal processing and machine learning for mental health research and clinical applications [perspectives]. IEEE Signal Processing Magazine, 34(5):196–195, 2017.
- [7] Shrikanth Narayanan and Panayiotis G Georgiou. Behavioral signal processing: Deriving human behavioral informatics from speech and language. Proceedings of the IEEE, 101(5):1203–1233, 2013.
- [8] Adam C Frank, Ruibei Li, Bradley S Peterson, and Shrikanth S Narayanan. Wearable and mobile technologies for the evaluation and treatment of obsessive-compulsive disorder: Scoping review. JMIR Mental Health, 10:e45572, 2023.
- [9] Melinda Y Chang and Mark S Borchert. Advances in the evaluation and management of cortical/cerebral visual impairment in children. Survey of ophthalmology, 65(6):708–724, 2020.
- [10] Mélodie Vidal, Jayson Turner, Andreas Bulling, and Hans Gellersen. Wearable eye tracking for mental health monitoring. Computer Communications, 35(11):1306–1311, 2012.
- [11] Vasileios Skaramagkas, Giorgos Giannakakis, Emmanouil Ktistakis, Dimitris Manousos, Ioannis Karatzanis, Nikolaos S Tachos, Evanthia Tripoliti, Kostas Marias, Dimitrios I Fotiadis, and Manolis Tsiknakis. Review of eye tracking metrics involved in emotional and cognitive processes. IEEE Reviews in Biomedical Engineering, 16:260–277, 2021.
- [12] Noah J Sasson, Jed T Elison, Lauren M Turner-Brown, Gabriel S Dichter, and James W Bodfish. Brief report: Circumscribed attention in young children with autism. Journal of autism and developmental disorders, 41:242–247, 2011.
- [13] Shuo Wang, Ming Jiang, Xavier Morin Duchesne, Elizabeth A Laugeson, Daniel P Kennedy, Ralph Adolphs, and Qi Zhao. Atypical visual saliency in autism spectrum disorder quantified through model-based eye tracking. Neuron, 88(3):604–616, 2015.
- [14] Almudena Duque and Carmelo Vázquez. Double attention bias for positive and negative emotional faces in clinical depression: Evidence from an eye-tracking study. Journal of behavior therapy and experimental psychiatry, 46:107–114, 2015.
- [15] Claire E Manley, Christopher R Bennett, and Lotfi B Merabet. Assessing higher-order visual processing in cerebral visual impairment using naturalistic virtual-reality-based visual search tasks. Children, 2022.
- [16] Corinna M Bauer, Claire E Manley, John Ravenscroft, Howard Cabral, Daniel D Dilks, and Peter J Bex. Deficits in face recognition and consequent quality-of-life factors in individuals with cerebral visual impairment. Vision, 7(1):9, 2023.
- [17] Matthias Kümmerer, Thomas SA Wallis, and Matthias Bethge. Deepgaze ii: Reading fixations from deep features trained on object recognition. arXiv preprint arXiv:1610.01563, 2016.
- [18] René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021.
- [19] JE Jan, M Groenveld, AM Sykanda, and CS Hoyt. Behavioural characteristics of children with permanent cortical visual impairment. Developmental medicine & child neurology, 29(5):571–576, 1987.
- [20] Melinda Y Chang and Mark S Borchert. Methods of visual assessment in children with cortical visual impairment. Curr. Op.in Neurology, 2021.
- [21] Sharon Whiting, James E Jan, Peter KH Wong, Olof Flodmark, Kevin Farrell, and Andrew Q McCormick. Permanent cortical visual impairment in children. Developmental Medicine & Child Neurology, 27(6):730–739, 1985.
- [22] Hanna EA Sakki, Naomi J Dale, Jenefer Sargent, Teresa Perez-Roche, and Richard Bowman. Is there consensus in defining childhood cerebral visual impairment? a systematic review of terminology and definitions. British Journal of Ophthalmology, 2017.
- [23] William V Good, Chuan Hou, and Anthony M Norcia. Spatial contrast sensitivity vision loss in children with cortical visual impairment. Investigative Ophthalmology & Visual Science, 53(12):7730–7734, 2012.
- [24] Stacey Ann Cohen-Maitre and Paul Haerich. Visual attention to movement and color in children with cortical visual impairment. Journal of Visual Impairment & Blindness, 99(7):389–402, 2005.
- [25] Zahide Pamir et al. Neural correlates associated with impaired global motion perception in cerebral visual impairment (cvi). NeuroImage: Clinical, 32:102821, 2021.
- [26] Gordon Dutton et al. Cortical visual dysfunction in children: a clinical study. Eye, 10(3):302–309, 1996.
- [27] Laurent Itti, Christof Koch, and Ernst Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on pattern analysis and machine intelligence, 20(11):1254–1259, 1998.
- [28] James Tanner and Laurent Itti. A top-down saliency model with goal relevance. Journal of vision, 19(1):11–11, 2019.
- [29] Alec Radford et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- [30] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv:1807.03748, 2018.
- [31] Huaishao Luo, Junwei Bao, Youzheng Wu, Xiaodong He, and Tianrui Li. Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In International Conference on Machine Learning, pages 23033–23044. PMLR, 2023.
- [32] Melinda Y Chang and Mark S Borchert. Comparison of eye tracking and Teller acuity cards for visual acuity assessment in pediatric cortical/cerebral visual impairment (CVI). Americal Journal of Ophthalmology, 2023.
- [33] Melinda Y Chang and Mark S Borchert. Validity and reliability of eye tracking for visual acuity assessment in children with cortical visual impairment. Journal of American Association for Pediatric Ophthalmology and Strabismus, 25(6):334–e1, 2021.
- [34] Bill Albert and Tom Tullis. Measuring the User Experience: Collecting, Analyzing, and Presenting UX Metrics. Morgan Kaufmann, 2022.