Visual Embedding of Screen Sequences for User-Flow Search in Example-driven Communication
Abstract.
Effective communication of UX considerations to stakeholders (e.g., designers and developers) is a critical challenge for UX practitioners. To explore this problem, we interviewed four UX practitioners about their communication challenges and strategies. Our study identifies that providing an example user flow—a screen sequence representing a semantic task—as evidence reinforces communication, yet finding relevant examples remains challenging. To address this, we propose a method to systematically retrieve user flows using semantic embedding. Specifically, we design a model that learns to associate screens’ visual features with user flow descriptions through contrastive learning. A survey confirms that our approach retrieves user flows better aligned with human perceptions of relevance. We analyze the results and discuss implications for the computational representation of user flows.

An example retrieval results from our embedding model. The figure shows the two input methods: a screen sequence and text query ”Add batteries to a cart” and their relevant retrieval result from user flow database based on the semantic similarity. Specifically, the figure consists of three user flow examples and their similarity ranks.
1. Introduction
Effective communication of user experience (UX) consideration to stakeholders is critical to user-centric product design (Shukla et al., 2024). UX practice is a collaborative process involving stakeholders with diverse professions (e.g., product management, design, development) (Feng et al., 2023). Accordingly, communicating the importance of UX considerations to other stakeholders is a critical challenge (Nørgaard and Hornbæk, 2006; Gulliksen et al., 2006; Fan et al., 2020). The key characteristic of the challenge is to maximize the buy-in of the stakeholders to make user-aligned decisions by strengthening the argument (MacDonald et al., 2022). In response, previous research highlights the communication strategies including rhetoric techniques (Dumas et al., 2004; Gray, 2016; Rose and Tenenberg, 2016), exposing stakeholders to a user (Følstad et al., 2012), and visual illustrations (Hornbæk and Frøkjær, 2005; Rose and Tenenberg, 2016; Gray, 2016; MacDonald et al., 2022) to demonstrate the significance of the UX considerations.
Among these approaches, example-driven communication suggests effectiveness (Karlgren and Ramberg, 2012), allowing stakeholders an empirical observation of considerations (Herring et al., 2009; Kang et al., 2018). However, UX practitioners report difficulties searching for examples precisely aligned with their considerations (Wu et al., 2021a; Son et al., 2024). Prior research addresses this challenge with machine learning models that embed information from single-screen examples, leveraging view hierarchies, spatial layouts (Deka et al., 2017; Jiang et al., 2024; Huang et al., 2019), and screen text or graphics (Bai et al., 2021; Liu et al., 2018; Bunian et al., 2021) for semantic search. Although these approaches capture screen-level semantics, a single screen cannot illustrate the user flow, a screen sequence that represents a semantically meaningful task (Kang et al., 2018; Deka et al., 2016). Information on user flow is key to examples as they demonstrate the design rationales through user interaction scenarios. Yet, only a few studies have addressed embedding user flows via screen sequences (Deka et al., 2016; Li et al., 2021b; He et al., 2021), as the problem requires the model to learn both the visual and temporal dimensions of the screens and associate them.
Building on these observations, we propose a method to systematically search user flow examples for evidence-driven communication. To ground our approach, we conducted a focus group with four UX practitioners to explore communication strategies. Our findings confirm that practitioners use concrete user flow examples to communicate UX considerations, especially during the planning phase. To support the systematic search of user flow examples, we propose a method for capturing user flow semantics from visual features in a screen sequence. Inspired by video classification research (Zhao et al., 2024; Arnab et al., 2021; Yuan et al., 2023), the model combines visual features using multi-head attention pooling and aligns them with user flow descriptions via contrastive learning on screen sequence–text pairs. Our model allows semantic search of user flows by extracting embeddings from an input screen sequence (Figure 1A) or text (Figure 1B) and retrieving the most relevant examples from a database using cosine similarity (Figure 1C). A user survey comparing our approach with retrieval based on text embeddings shows that our method aligns more closely with human perceptions of relevance.
In summary, this paper contributes (1) an in-depth interview study on how UX practitioners communicate UX considerations to stakeholders, (2) a novel embedding method for user flow for semantic search, and (3) a survey study that evaluates the effectiveness of this approach and provides implication for computational representation of user flow.
2. Related Works
2.1. Example-driven Communication of UX Considerations
Communication of UX considerations to stakeholders is critical in collaboration for user-centric product design (Gray, 2016; Rose and Tenenberg, 2016), helping product managers and developers embrace user-centric decisions (Shukla et al., 2024; MacDonald et al., 2022). Effective communication addresses two major collaboration challenges: keeping pace with collaborators and aligning priority conflicts (MacDonald et al., 2022). It ensures the UX is sufficiently considered among collaborators in the fast-paced product development process (Holtzblatt and Holtzblatt, 2014). Also, it aligns UX considerations with other collaborators’ priorities (e.g., retention and conversion metric) (Shukla et al., 2024). To support communication, previous research explored the strategies including rhetoric techniques (Rose and Tenenberg, 2016; Dumas et al., 2004; Gray, 2016), exposing stakeholders to a user (Følstad et al., 2012), and visual illustrations (Hornbæk and Frøkjær, 2005; Rose and Tenenberg, 2016; Gray, 2016; MacDonald et al., 2022) to demonstrate the significance of the UX considerations.
Example-driven communication is one of the effective communication strategies (Karlgren and Ramberg, 2012). In design practices, examples inspire practitioners by showing how content structures and integrates (Lee et al., 2010; Swearngin et al., 2018; Choi et al., 2024). Similarly, when collaborating with stakeholders, examples strengthen the communication by allowing empirical observation (Herring et al., 2009; Kang et al., 2018). They clarify the rationale behind considerations (Kang et al., 2018) and help establish a shared understanding of problems (Karlgren and Ramberg, 2012). While examples take various forms, a prominent theme is user flow, incorporating journey maps, video/audio reels, flows, and user stories (MacDonald et al., 2022). User flow represents semantics that arise from interaction sequences such as user task (Deka et al., 2016), which is more challenging to infer from than static screens. In response, this work addresses the computational method to help practitioners search user flow examples through semantic embedding.
2.2. Semantic Embedding of Screens
Semantic embedding of screens, which produces computational representations received continuous focus for downstream tasks such as search (Huang et al., 2019), screen captioning (Wang et al., 2021), component prediction (Li et al., 2021b; Jiang et al., 2024), and code generation (Moran et al., 2018). A mainstream approach leverages view hierarchy and layout information as representative features (Deka et al., 2017), demonstrating embedding models’ ability to identify visually coherent examples (Deka et al., 2017; Huang et al., 2019). Studies further add semantic tags for each layout components (Li et al., 2021b; Bunian et al., 2021; Liu et al., 2018; Wu et al., 2021b) or text and images with pre-trained encoders (Wang et al., 2021; Ang and Lim, 2022; Li et al., 2021a; Baechler et al., 2024; Bai et al., 2021), effectively capturing domain-specific screen characteristics like app category or feature (Wang et al., 2021; Ang and Lim, 2022; Li et al., 2021a; Bai et al., 2021). Especially, Wu et al. incorporated temporal dimension in embedding to capture the screen animation (Wu et al., 2020)
Furthermore, studies explored methods to embed user flows through screen sequences, which represent a meaningful task through step-by-step interactions (Deka et al., 2016). Embedding screen sequences expands computational capabilities including user flow search (Deka et al., 2016), screen relation inference (He et al., 2021; Li et al., 2021b), and interaction prediction (Rawles et al., 2024). The primary approach involves learning temporal context with order-sensitive models (He et al., 2021; Li et al., 2021b), and studies explored large language models’ capability to comprehend user flows (Lu et al., 2024; Chen et al., 2024). Recently, Yicheng et al. proposed a JEPA-based design (Assran et al., 2023) that generates user flow video embeddings to represent user intent (Fu et al., 2024). Building on recent video understanding models (Zhao et al., 2024; Arnab et al., 2021; Yuan et al., 2023), we address this problem with order-sensitive multi-head attention pooling of visual features in screen sequences.
3. Formative Study
We conducted a formative study with UX practitioners to investigate the challenges of example-based communication. Specifically, we aimed to narrow the understanding by exploring two key questions: (Q1) In what contexts does example-driven communication arise? and (Q2) What attributes of examples are most important to practitioners? To develop a generalizable understanding of the problem, we conducted a focus group interview (Wilkinson, 1998).
3.1. Study Design
We recruited four UX practitioners with professional experience through snowball sampling. The group consisted of three product designers (N = 3) and one UX researcher (N = 1), with an average of 1.27 years (SD = 7.8 months) of professional experience. To encourage detailed responses, the group interview was conducted in the interviewees’ primary language. Two researchers collaboratively analyzed the study data through thematic analysis (Saldana, 2021).
The study began with a five-minute introductory session of the study protocol and consent. The main interview was divided into two sub-sessions, each lasting around 40 minutes. In the first sub-session, we focused on (Q1) identifying the context of example-driven communication. We asked questions about interviewees’ current collaboration patterns, existing communication challenges, and strategies they used to address these challenges. In the second session, we focused on (Q2) understanding the important attributes of examples. We introduced a hypothetical AI support tool scenario that provided examples to interviewees and asked them for evaluation. The researchers ended the interview by summarizing the discussion and asking for additional feedback.
3.2. Findings
3.2.1. Moderating Conflict Through Evidence
Participants noted that moderating conflict within the organization is a key communication challenge. With constantly shifting collaborators (P1, P2, P3), participants reported experiencing various conflicts over UX considerations. The major source of conflict was each team’s different priorities. These priorities influenced the UX considerations addressed (P2, P3) and their relative importance in the schedule (P3, P4). The conflict over priorities stems from different estimations of user experience impact; P4 stated “For instance, while some consider increasing the purchase conversion rate most important, others might view app inflow rate as more critical, emphasizing the overall main UI as more important.”
In response, all participants emphasized the importance of evidence in moderating the conflict. Evidence helps persuade stakeholders regarding UX considerations (P1, P3, P4) and accelerates decision-making (P3). User research data, examples, design artifacts, professional experience, and academic research are major types of evidence. Participants also combined different evidence types to strengthen communication, e.g., comparing user data between their product and the example (P1) or matching user feedback with example cases (P3).
3.2.2. Characteristics of Examples as Evidence
Participants commonly utilized examples as evidence during the planning phase to address uncertainty. In the planning phase, the lack of user data and artifacts introduced uncertainty to communication (P1, P3, P4). To address the problem, participants searched for existing examples as evidence (P3, P4). Upon searching, they collected examples directly related to the product domain, such as examples from main competitors (P2, P3, P4). They looked for relatable aspects from examples to the current product; P2 noted, “By looking at the examples from competitors, we can see how their approach applies to our problem and identify the relevant issues or pain points.”
In this context, participants viewed examples as more abstract than a component or screen, relating to a user flow. Participants described examples in words such as product (P4), pattern (P1), and feature (P2, P3). Correspondingly, they used modalities that effectively represent user flow, including interaction logs (P1), screen recordings (P2), and flow maps (P3) to communicate the examples to stakeholders. These examples allowed stakeholders to make empirical observations on the UX considerations and explore solution scenarios (P1, P2, P3).
4. Methods

In the figure, a sequence of mobile app screens is processed by a Vision Transformer (ViT) to produce patch-based visual embeddings, which are then aggregated with a single-layer multi-head attention mechanism using temporal encoding and masking. Meanwhile, the text instruction (“Go to Amazon search bar and search Apple AirPods pro”) is transformed by a text embedding model. Finally, a contrastive loss aligns the aggregated screen sequence representation with its corresponding text embedding, allowing the model to learn how screen sequences and text instructions match.
In this section, we present our approach for embedding screen sequences to represent user flows. User flow is a screen sequence that represents a semantic task (Kang et al., 2018; Deka et al., 2016). Drawing inspiration from recent video classification models (Zhao et al., 2024; Arnab et al., 2021; Yuan et al., 2023), we design a model that pools visual features from screen sequence to align them with the user flow description using contrastive learning.
4.1. Model Architecture
To enable semantic embedding of screen sequences, we design an encoder-pooler head structure outlined by Yuan et al. (Yuan et al., 2023) in which the pooler aggregates information from encoders. The pooler head is a single cross-attention layer, capable of attending to meaningful visual features through a single learnable query token and self-attention. Our model captures the key visual features relevant to user flow-related semantics of the screen sequence. Additionally, we add padding and masking in the pooler to allow the embedding of variable-length screen sequences.
Visual Encoder. (Figure 2A) To capture visual features from each screen, we employ DinoV2 (ViT-L/14) with registers (Darcet et al., 2023; Oquab et al., 2023) as a frozen image encoder that takes 224224 screen images and outputs 1024-dimensional feature embeddings. Although end-to-end fine-tuning improves the performance (Yuan et al., 2023), we assume that ViT can capture the screen semantics without fine-tuning based on an observation with CLIP (Radford et al., 2021) by Seokhyeon et al. (Park et al., 2023).
Pooler Head. (Figure 2B) After the visual encoder extracts per-screen embeddings, we feed each sequence of embeddings into a learnable pooler head that produces a single vector representation. When feeding in, to handle variable-length sequences, we pad shorter embedding arrays to a uniform length and use masking to make the pooler focus only on valid frames. We base our pooler head on a single-layer multi-head attention pooling mechanism (Yuan et al., 2023). The model first performs linear projection of each 1024-dimensional screen embedding into a 256-dimensional space and adds a learnable positional embedding to reflect the temporal order. Next, a single query token attends to all positions in the sequence via multi-head attention (MHA). The input and aggregated output are normalized, passed through a small multi-layer perception (MLP), and finally projected to 1536 dimensions to match the size of text embeddings for user flow description. This single-query design allows the pooler head to understand the underlying spatial-temporal relationship between screens in a screen flow.
4.2. Dataset
We train our model using data from the Android in the Wild (ATIW) (Rawles et al., 2024), which provides extensive sequences of real-world device interactions. In particular, we focus on a subset of the “SINGLE” partition, which focuses on the single-user flows. While keeping the variable length of each sequence, we filter sequences to those with 3–6 screens each, resulting in 12,659 final sequences (episodes) and a total of 49,590 screens. Our motivation is to isolate a moderate range of screen lengths often encountered in single-user flows (e.g., a few steps to add an item to a cart) and reduce the variability within the dataset. We did not include the visual interaction traces between screens in the image. Each screen image is resized to 224224 pixels during preprocessing through non-uniform scaling to match the input constraints of DinoV2 (Oquab et al., 2023). Additionally, we used OpenAI’s text-embedding-3-small (OpenAI, 2024), to generate 1536-dimensional embeddings for each sequence’s task description provided by ATIW.
4.3. Training Configurations
Our pooler head is trained to align screen-sequence embeddings with text embeddings through a contrastive learning objective similar to CLIP (Figure 2C) (Radford et al., 2021). Specifically, we adopt a method suggested by Zhao et al. (Zhao et al., 2024) in which the model minimizes a symmetric cross-entropy loss over the similarity scores of all image sequence-text pairs in a mini-batch:
(1) |
where and denote mini-batch matrices of normalized screen-sequence and text embeddings, and is a fixed temperature. For training, randomly divide our dataset into training (90%) and validation (10%) splits, and train the model for 100 epochs using the Adam optimizer at a learning rate of , with a batch size of 1024 episodes per iteration. During each gradient update, the pooler attends the visual embedding of screens and produces a single output, which is then compared against the corresponding textual embedding via the contrastive loss. Since the visual and text encoder remain frozen, the network converges quickly. We observed that the validation loss stabilized at approximately 100 epochs with a value of 2.616 (using PyTorch seed 123).
5. Evaluation
We conducted a user study to evaluate the performance of our model and investigate its practical applications in design processes. In the study, our objective was to address two key questions: (1) how our approach compares to the text description-based baseline and (2) how it could be integrated into real-world design practices. Additionally, we report a minor implementation error in the model; the impact on performance was minimal.
5.1. Study Design
We recruited 21 participants—14 with UX experience and 7 without—for an online survey that included 27 multiple-choice questions and 15 optional open-ended questions. The study consisted of two main tasks.
In Task 1, participants assessed the similarity between a source screen sequence and two candidate sequences; each candidate was obtained using the baseline model and our model. We randomly sampled five source screen sequences. We measured the similarity using a 5-point Likert scale across four dimensions (service similarity, screen type similarity, content similarity, and visual similarity). These dimensions reflect established practices for assessing screen-based relevance and layout coherence (Li et al., 2021b). To minimize potential order bias, we randomized the presentation order of two candidate sequences. In Task 2, participants assessed design examples retrieved by our model in practical scenarios (e.g., ”designing a search result page”) from a UX practitioner’s perspective. Based on established design applicability measures (Jeon et al., 2021), participants rated each candidate example on five aspects using a 5-point Likert scale: layout alignment, UI component utility, innovation potential, integration feasibility, and overall reference value.
For data analysis, we examined the responses of both tasks using different analytical approaches. For Task 1, we verified the normality of the data distribution using the Shapiro-Wilk test (), confirming normality, and subsequently conducted paired t-tests to statistically compare our approach with the baseline. For Task 2, we employed a descriptive analysis of the quantitative Likert-scale response.
5.2. Findings
5.2.1. Effectiveness of Sequence-based Search

The figure presents two box plots comparing scores (on a 1–5 Likert scale) between the baseline model (lower scores) and our sequence-based search model (higher scores). Each box shows the median, quartiles. The difference between the two models’ scores is statistically significant.
As shown in Figure 3, our sequence-based search model demonstrated statistically significant higher similarity scores compared to the baseline condition (). The high -statistic (8.50) indicates not just statistical significance but also a substantial practical difference in performance. This improvement was observed consistently across participants with UX experience () and those without ().
The difference between our model and the baseline was modest in magnitude () but consistent. A detailed analysis revealed that participants with UX experience rated our model higher () than the baseline (). Similarly, participants without UX experience also preferred our model () over the baseline (). Notably, both groups showed comparable levels of improvement (UX experienced: , non-experienced: ), suggesting the model’s effectiveness generalizes across different levels of UX expertise.
5.2.2. Practical Value as UX Design Reference
The model’s value as a design reference was evaluated across five key dimensions, with all aspects scoring above 3.50 on average (on a 5-point scale). Participants positively assessed the system’s overall reference value for design work (). One UX designer particularly emphasized the benefit of viewing consistent UI sequences together, noting, “it’s nice to see the sequence of consistent UI because, if I search them separately, that could create inconsistencies in my UI flow.”
Participants rated the system highly on ease of integration into their current design processes (). Our model also received favorable evaluations regarding the practicality of the UI component () and the applicability of the layout and user flow (). Correlation analysis indicated a strong relationship between UI component practicality and layout applicability (). Several participants highlighted the integrated value of layouts and UI components, as reflected in comments like, “I think the structural and button designs could be good references for UI development.”
5.2.3. Design Convention and Innovation
The innovative design solutions aspect received relatively lower ratings () and showed weaker correlations with other evaluation aspects (). Participants’ responses indicated a preference for conventional UI practices. One UX practitioner noted that “these are common UI practices,” highlighting the value of capturing established UI conventions. Another participant commented, “Why does the UI need to be newly designed or innovative in the first place? Doesn’t that make the user flow more unfamiliar?”
6. Discussion
6.1. Reinforcing User Flow Semantics with Visual Information
The evaluation result suggests that visual information can effectively reinforce user flow semantics. Our model achieved statistically higher ratings consistently, indicating a stronger alignment with human perception of similarity. Notably, despite the statistical significance and consistent gap, the average rating difference between our model and the baseline was relatively moderate. This indicates that the visual features captured by our model consistently add the information available in the textual description that predicts the semantic similarity between the two sequences. Specifically, we observed that screen sequences retrieved by our model shared a series of visual features with the source screen sequence, and these features add precision to the task description (e.g., the keyword “search” retrieved screen sequences that contain interactions with a search bar). Considering the model’s simple architecture, the model’s capability to link textual semantics to a specific sequence of visual features is promising.
6.2. Design Implications for Sequence-based UI Reference Systems
Based on our formative study, our sequence-based approach demonstrates the potential to enhance example-driven communication in UX by providing rich user journeys, extending beyond isolated examples (Karlgren and Ramberg, 2012). Regarding this, features that connect sequence examples with journey maps or task flows could enhance contextual understanding (Karlgren and Ramberg, 2012; MacDonald et al., 2022). Additionally, improving system explainability would boost trustworthiness as UX communication has an evidence-based nature (Herring et al., 2009; Kang et al., 2018). Furthermore, for the effective utilization of retrieved sequences, careful integration with existing design tools such as Figma or Sketch would be recommended (Figma, 2025; Sketch, 2025).
Another notable finding was the UX search expectation between convention and innovation. Our sequence-based search has the feasibility to help identify established UI patterns. Furthermore, the user preference for familiar UI conventions reflects practical reference needs in professional settings. However, to support diverse practical scenarios, supporting innovative design exploration for diversity aspects can be considered for additional search options by providing diversity metrics rather than similarity alone.
6.3. Limitations and Future Work
Despite these promising results, considerable uncertainties regarding the model’s performance and applicability remain. Our model is trained on a narrow subset of the ATIW dataset (Rawles et al., 2024) with domain-specific examples, hence the ability to handle diverse user flow types remains unknown. The model’s understanding of textual descriptions is limited to the label distribution in training data, thereby constraining the selection of search queries. Additionally, the current architecture uses a relatively simple attention-pooling method with a pre-trained ViT and text encoder, focusing primarily on evaluating the model’s capabilities. To advance this approach, future iterations could incorporate screen text (Wang et al., 2021) and user interactions between screens to model training. Incorporating model design strategies such as variable resolution input (Lee et al., 2023), low-rank adapters (Yuan et al., 2023), and masked autoencoders (Tong et al., 2022; Zhao et al., 2024) could further enhance the performance. Integrating language decoders (Li and Li, 2022; Baechler et al., 2024) could extend functionality beyond search to tasks such as interaction localization, question answering, and action prediction with relevant benchmark evaluations. Similarly, a comparative evaluation with an existing video foundation model (Zhao et al., 2024; Wang et al., 2022) and previous screen sequence embedding approaches (Deka et al., 2016; Li et al., 2021b; Lu et al., 2024) against benchmark can highlight the relative benefit of our approach. Furthermore, an evaluation with a more diverse UX practitioners can expand exploration towards more specific design scenarios in real-world applications.
7. Conclusion
We explored an approach to enhance UX practitioners’ ability to communicate design considerations through sequence-based example search. Our formative study showed that practitioners’ use user flow examples to communicate UX considerations. We developed and evaluated an AI model for embedding screen sequences to extract user flow semantics, which improved human alignment compared to text-based retrieval baselines. A survey involving 21 practitioners with varying UX backgrounds revealed that our approach outperformed the baseline in retrieving relevant user flow example and showed potential for integration with design practices. This research advances the understanding of user flow by exploring its role in communication and its computational representation.
Acknowledgements.
We are deeply grateful for our participants and reviewers who significantly contributed to this work. We especially thank our teammates, Dongyu and Woohyun, for their collaboration throughout the HCI course. We also appreciate Jaesang, Hyoungwook, Daehyun, Yeonsu, and Bekzat for providing valuable feedback on our drafts.References
- (1)
- Ang and Lim (2022) Gary Ang and Ee-Peng Lim. 2022. Learning Semantically Rich Network-based Multi-modal Mobile User Interface Embeddings. ACM Transactions on Interactive Intelligent Systems 12, 4 (2022), 1–29.
- Arnab et al. (2021) Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. 2021. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision. 6836–6846.
- Assran et al. (2023) Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. 2023. Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15619–15629.
- Baechler et al. (2024) Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Cărbune, Jason Lin, Jindong Chen, and Abhanshu Sharma. 2024. Screenai: A vision-language model for ui and infographics understanding. arXiv preprint arXiv:2402.04615 (2024).
- Bai et al. (2021) Chongyang Bai, Xiaoxue Zang, Ying Xu, Srinivas Sunkara, Abhinav Rastogi, Jindong Chen, et al. 2021. Uibert: Learning generic multimodal representations for ui understanding. arXiv preprint arXiv:2107.13731 (2021).
- Bunian et al. (2021) Sara Bunian, Kai Li, Chaima Jemmali, Casper Harteveld, Yun Fu, and Magy Seif Seif El-Nasr. 2021. Vins: Visual search for mobile user interface design. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–14.
- Chen et al. (2024) Dongping Chen, Yue Huang, Siyuan Wu, Jingyu Tang, Liuyi Chen, Yilin Bai, Zhigang He, Chenlong Wang, Huichi Zhou, Yiqiang Li, et al. 2024. GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents. arXiv preprint arXiv:2406.10819 (2024).
- Choi et al. (2024) DaEun Choi, Sumin Hong, Jeongeon Park, John Joon Young Chung, and Juho Kim. 2024. CreativeConnect: Supporting Reference Recombination for Graphic Design Ideation with Generative AI. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 1055, 25 pages. doi:10.1145/3613904.3642794
- Darcet et al. (2023) Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. 2023. Vision transformers need registers. arXiv preprint arXiv:2309.16588 (2023).
- Deka et al. (2017) Biplab Deka, Zifeng Huang, Chad Franzen, Joshua Hibschman, Daniel Afergan, Yang Li, Jeffrey Nichols, and Ranjitha Kumar. 2017. Rico: A mobile app dataset for building data-driven design applications. In Proceedings of the 30th annual ACM symposium on user interface software and technology. 845–854.
- Deka et al. (2016) Biplab Deka, Zifeng Huang, and Ranjitha Kumar. 2016. ERICA: Interaction mining mobile apps. In Proceedings of the 29th annual symposium on user interface software and technology. 767–776.
- Dumas et al. (2004) Joseph S Dumas, Rolf Molich, and Robin Jeffries. 2004. Describing Usability Problems: Are we sending the right message? interactions 11, 4 (2004), 24–29.
- Fan et al. (2020) Mingming Fan, Serina Shi, and Khai N Truong. 2020. Practices and Challenges of Using Think-Aloud Protocols in Industry: An International Survey. Journal of Usability Studies 15, 2 (2020).
- Feng et al. (2023) KJ Kevin Feng, Tony W Li, and Amy X Zhang. 2023. Understanding collaborative practices and tools of professional UX practitioners in software organizations. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–20.
- Figma (2025) Figma. 2025. Figma: Collaborative Interface Design Tool. https://www.figma.com. Accessed: 2025-01-23.
- Følstad et al. (2012) Asbjørn Følstad, Effie Law, and Kasper Hornbæk. 2012. Analysis in practical usability evaluation: a survey study. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Austin, Texas, USA) (CHI ’12). Association for Computing Machinery, New York, NY, USA, 2127–2136. doi:10.1145/2207676.2208365
- Fu et al. (2024) Yicheng Fu, Raviteja Anantha, Prabal Vashisht, Jianpeng Cheng, and Etai Littwin. 2024. UI-JEPA: Towards Active Perception of User Intent through Onscreen User Activity. arXiv preprint arXiv:2409.04081 (2024).
- Gray (2016) Colin M Gray. 2016. ”It’s More of a Mindset Than a Method” UX Practitioners’ Conception of Design Methods. In Proceedings of the 2016 CHI conference on human factors in computing systems. 4044–4055.
- Gulliksen et al. (2006) Jan Gulliksen, Inger Boivie, and Bengt Göransson. 2006. Usability professionals—current practices and future development. Interacting with computers 18, 4 (2006), 568–600.
- He et al. (2021) Zecheng He, Srinivas Sunkara, Xiaoxue Zang, Ying Xu, Lijuan Liu, Nevan Wichers, Gabriel Schubiner, Ruby Lee, and Jindong Chen. 2021. Actionbert: Leveraging user actions for semantic understanding of user interfaces. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 5931–5938.
- Herring et al. (2009) Scarlett R Herring, Chia-Chen Chang, Jesse Krantzler, and Brian P Bailey. 2009. Getting inspired! Understanding how and why examples are used in creative design practice. In Proceedings of the SIGCHI conference on human factors in computing systems. 87–96.
- Holtzblatt and Holtzblatt (2014) Karen Holtzblatt and Shoshana Holtzblatt. 2014. Communicating user research in order to drive design and product decisions. In CHI’14 Extended Abstracts on Human Factors in Computing Systems. 1155–1158.
- Hornbæk and Frøkjær (2005) Kasper Hornbæk and Erik Frøkjær. 2005. Comparing usability problems and redesign proposals as input to practical systems development. In Proceedings of the SIGCHI conference on Human factors in computing systems. 391–400.
- Huang et al. (2019) Forrest Huang, John F Canny, and Jeffrey Nichols. 2019. Swire: Sketch-based user interface retrieval. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–10.
- Jeon et al. (2021) Youngseung Jeon, Seungwan Jin, Patrick C. Shih, and Kyungsik Han. 2021. FashionQ: An AI-Driven Creativity Support Tool for Facilitating Ideation in Fashion Design. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 576, 18 pages. doi:10.1145/3411764.3445093
- Jiang et al. (2024) Yue Jiang, Changkong Zhou, Vikas Garg, and Antti Oulasvirta. 2024. Graph4GUI: Graph Neural Networks for Representing Graphical User Interfaces. In Proceedings of the CHI Conference on Human Factors in Computing Systems. 1–18.
- Kang et al. (2018) Hyeonsu B Kang, Gabriel Amoako, Neil Sengupta, and Steven P Dow. 2018. Paragon: An online gallery for enhancing design feedback with visual examples. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 1–13.
- Karlgren and Ramberg (2012) Klas Karlgren and Robert Ramberg. 2012. The use of design patterns in overcoming misunderstandings in collaborative interaction design. CoDesign 8, 4 (2012), 231–246.
- Lee et al. (2010) Brian Lee, Savil Srivastava, Ranjitha Kumar, Ronen Brafman, and Scott R Klemmer. 2010. Designing with interactive example galleries. In Proceedings of the SIGCHI conference on human factors in computing systems. 2257–2266.
- Lee et al. (2023) Kenton Lee, Mandar Joshi, Iulia Raluca Turc, Hexiang Hu, Fangyu Liu, Julian Martin Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. 2023. Pix2struct: Screenshot parsing as pretraining for visual language understanding. In International Conference on Machine Learning. PMLR, 18893–18912.
- Li and Li (2022) Gang Li and Yang Li. 2022. Spotlight: Mobile ui understanding using vision-language models with a focus. arXiv preprint arXiv:2209.14927 (2022).
- Li et al. (2021b) Toby Jia-Jun Li, Lindsay Popowski, Tom Mitchell, and Brad A Myers. 2021b. Screen2vec: Semantic embedding of gui screens and gui components. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–15.
- Li et al. (2021a) Yang Li, Gang Li, Xin Zhou, Mostafa Dehghani, and Alexey Gritsenko. 2021a. Vut: Versatile ui transformer for multi-modal multi-task user interface modeling. arXiv preprint arXiv:2112.05692 (2021).
- Liu et al. (2018) Thomas F Liu, Mark Craft, Jason Situ, Ersin Yumer, Radomir Mech, and Ranjitha Kumar. 2018. Learning design semantics for mobile apps. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology. 569–579.
- Lu et al. (2024) Yuwen Lu, Ziang Tong, Qinyi Zhao, Yewon Oh, Bryan Wang, and Toby Jia-Jun Li. 2024. Flowy: Supporting UX Design Decisions Through AI-Driven Pattern Annotation in Multi-Screen User Flows. arXiv preprint arXiv:2406.16177 (2024).
- MacDonald et al. (2022) Craig M MacDonald, Emma J Rose, and Cynthia Putnam. 2022. How, why, and with whom do user experience (UX) practitioners communicate? Implications for HCI education. International Journal of Human–Computer Interaction 38, 15 (2022), 1422–1439.
- Moran et al. (2018) Kevin Moran, Carlos Bernal-Cárdenas, Michael Curcio, Richard Bonett, and Denys Poshyvanyk. 2018. Machine learning-based prototyping of graphical user interfaces for mobile apps. IEEE Transactions on Software Engineering 46, 2 (2018), 196–221.
- Nørgaard and Hornbæk (2006) Mie Nørgaard and Kasper Hornbæk. 2006. What do usability evaluators do in practice? An explorative study of think-aloud testing. In Proceedings of the 6th conference on Designing Interactive systems. 209–218.
- OpenAI (2024) OpenAI. 2024. text-embedding-3-small. https://platform.openai.com/docs/guides/embeddings. Accessed: 2024-11-21.
- Oquab et al. (2023) Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. 2023. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023).
- Park et al. (2023) Seokhyeon Park, Wonjae Kim, Young-Ho Kim, and Jinwook Seo. 2023. Computational Approaches for App-to-App Retrieval and Design Consistency Check. arXiv preprint arXiv:2309.10328 (2023).
- Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
- Rawles et al. (2024) Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. 2024. Androidinthewild: A large-scale dataset for android device control. Advances in Neural Information Processing Systems 36 (2024).
- Rose and Tenenberg (2016) Emma Rose and Josh Tenenberg. 2016. Arguing about design: A taxonomy of rhetorical strategies deployed by user experience practitioners. In Proceedings of the 34th ACM International Conference on the Design of Communication. 1–10.
- Saldana (2021) Johnny Saldana. 2021. The coding manual for qualitative researchers. SAGE publications Ltd.
- Shukla et al. (2024) Prakash Shukla, Suchismita Naik, Ike Obi, Phuong Bui, and Paul Parsons. 2024. Communication Challenges Reported by UX Designers on Social Media: An Analysis of Subreddit Discussions. In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems. 1–6.
- Sketch (2025) Sketch. 2025. Sketch: Design, Prototype, Collaborate and Handoff. https://www.sketch.com. Accessed: 2025-01-23.
- Son et al. (2024) Kihoon Son, DaEun Choi, Tae Soo Kim, Young-Ho Kim, and Juho Kim. 2024. GenQuery: Supporting Expressive Visual Search with Generative Models. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 180, 19 pages. doi:10.1145/3613904.3642847
- Swearngin et al. (2018) Amanda Swearngin, Mira Dontcheva, Wilmot Li, Joel Brandt, Morgan Dixon, and Amy J Ko. 2018. Rewire: Interface design assistance from examples. In Proceedings of the 2018 CHI conference on human factors in computing systems. 1–12.
- Tong et al. (2022) Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. 2022. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems 35 (2022), 10078–10093.
- Wang et al. (2021) Bryan Wang, Gang Li, Xin Zhou, Zhourong Chen, Tovi Grossman, and Yang Li. 2021. Screen2words: Automatic mobile UI summarization with multimodal learning. In The 34th Annual ACM Symposium on User Interface Software and Technology. 498–510.
- Wang et al. (2022) Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, et al. 2022. Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191 (2022).
- Wilkinson (1998) Sue Wilkinson. 1998. Focus group methodology: a review. International journal of social research methodology 1, 3 (1998), 181–203.
- Wu et al. (2021b) Jason Wu, Xiaoyi Zhang, Jeff Nichols, and Jeffrey P Bigham. 2021b. Screen parsing: Towards reverse engineering of ui models from screenshots. In The 34th Annual ACM Symposium on User Interface Software and Technology. 470–483.
- Wu et al. (2020) Ziming Wu, Yulun Jiang, Yiding Liu, and Xiaojuan Ma. 2020. Predicting and diagnosing user engagement with mobile ui animation via a data-driven approach. In Proceedings of the 2020 CHI conference on human factors in computing systems. 1–13.
- Wu et al. (2021a) Ziming Wu, Qianyao Xu, Yiding Liu, Zhenhui Peng, Yingqing Xu, and Xiaojuan Ma. 2021a. Exploring designers’ practice of online example management for supporting mobile ui design. In Proceedings of the 23rd International Conference on Mobile Human-Computer Interaction. 1–12.
- Yuan et al. (2023) Liangzhe Yuan, Nitesh Bharadwaj Gundavarapu, Long Zhao, Hao Zhou, Yin Cui, Lu Jiang, Xuan Yang, Menglin Jia, Tobias Weyand, Luke Friedman, et al. 2023. Videoglue: Video general understanding evaluation of foundation models. arXiv preprint arXiv:2307.03166 (2023).
- Zhao et al. (2024) Long Zhao, Nitesh B Gundavarapu, Liangzhe Yuan, Hao Zhou, Shen Yan, Jennifer J Sun, Luke Friedman, Rui Qian, Tobias Weyand, Yue Zhao, et al. 2024. Videoprism: A foundational visual encoder for video understanding. arXiv preprint arXiv:2402.13217 (2024).
Appendix A Survey Results for Similarity Evaluation in Task 1
Parti- cipant | UX Experience | Our Model Score | Baseline Model Score | Difference (Our - Baseline) |
A1 | Yes | 4.33 | 3.17 | 1.17 |
A2 | Yes | 3.50 | 3.00 | 0.50 |
A3 | Yes | 3.17 | 3.00 | 0.17 |
A4 | Yes | 4.33 | 3.17 | 1.17 |
A5 | No | 3.33 | 3.00 | 0.33 |
A6 | No | 3.00 | 2.50 | 0.50 |
A7 | Yes | 4.33 | 3.83 | 0.50 |
A8 | No | 4.50 | 3.67 | 0.83 |
A9 | Yes | 3.67 | 2.50 | 1.17 |
A10 | No | 3.50 | 2.83 | 0.67 |
A11 | No | 3.83 | 2.83 | 1.00 |
A12 | Yes | 3.33 | 3.33 | 0.00 |
A13 | No | 3.00 | 2.50 | 0.50 |
A14 | Yes | 4.00 | 3.50 | 0.50 |
A15 | Yes | 3.50 | 2.83 | 0.67 |
A16 | Yes | 3.33 | 3.17 | 0.17 |
A17 | Yes | 4.17 | 3.33 | 0.83 |
A18 | Yes | 4.33 | 3.50 | 0.83 |
A19 | Yes | 4.00 | 3.50 | 0.50 |
A20 | No | 3.17 | 2.83 | 0.33 |
A21 | Yes | 3.67 | 3.00 | 0.67 |
Appendix B Scenario example in Task2
This section presents three example scenarios used in Task 2, along with the screen sequences retrieved by our model for each query. Figure 4 shows representative results retrieved for each scenario.
Scenario 1
- Scenario Description:
-
You’re trying to design a search result page and want to find relevant examples.
- Text Query:
-
Retrieve a pattern, ”check a search result”
Scenario 2
- Scenario Description:
-
You’re trying to design a product detail page and want to find relevant examples.
- Text Query:
-
Retrieve a pattern, ”scroll to a product detail”
Scenario 3
- Scenario Description:
-
You’re trying to design a multi-tab interface and want to find relevant examples.
- Text Query:
-
Retrieve a pattern, ”changing between tabs”

This composite image is arranged in a grid with multiple browser screenshots. The first set of screenshots (Scenario 1) shows search result pages listing various products. The second set (Scenario 2) focuses on detailed product pages for items like watches and electronics. The final set (Scenario 3) features a multi-tab interface comparing search results and product details across different sites.
Appendix C Survey Results for Design Process Integration Assessment in Task2
Question (5-scale Likert) | Mean | SD | Min | 25% | 50% | 75% | Max |
This screen’s layout and user flow would be applicable to my project | 3.63 | 1.11 | 1 | 3 | 4 | 5 | 5 |
The UI components would be valuable additions to my design | 3.68 | 1.09 | 1 | 3 | 4 | 4 | 5 |
This screen offers innovative design solution ideas that I could adapt for my project | 3.24 | 1.16 | 1 | 2 | 3 | 4 | 5 |
I could easily integrate elements from this screen into my current design process | 3.83 | 1.19 | 1 | 3 | 4 | 5 | 5 |
Overall, this screen serves as an excellent reference for my design work | 3.67 | 1.16 | 1 | 3 | 4 | 5 | 5 |