Learning Predictive Visuomotor Coordination

Wenqi Jia¹, Bolin Lai², Miao Liu³, Danfei Xu^2∗, James M. Rehg^1∗
¹University of Illinois Urbana-Champaign, ²Georgia Tech, ³Meta AI
{wenqij5,jrehg}@illinois.edu, {bolin.lai,danfei}@gatech.edu, [email protected]
* Equal Advising

Abstract

Understanding and predicting human visuomotor coordination is crucial for applications in robotics, human-computer interaction, and assistive technologies. This work introduces a forecasting-based task for visuomotor modeling, where the goal is to predict head pose, gaze, and upper-body motion from egocentric visual and kinematic observations. We propose a Visuomotor Coordination Representation (VCR) that learns structured temporal dependencies across these multimodal signals. We extend a diffusion-based motion modeling framework that integrates egocentric vision and kinematic sequences, enabling temporally coherent and accurate visuomotor predictions. Our approach is evaluated on the large-scale EgoExo4D dataset, demonstrating strong generalization across diverse real-world activities. Our results highlight the importance of multimodal integration in understanding visuomotor coordination, contributing to research in visuomotor learning and human behavior modeling. Project Page: https://vjwq.github.io/VCR/.

Figure 1: We represent the Human Visuomotor System as a joint encoding of Head Pose, 3D Gaze Direction, and Upper-Body Joints. Given a sequence of visuomotor inputs and egocentric frames, our goal is to predict how the system coordinates its movements in the future. As forecasted egocentric frames are unavailable, predicted 2D gaze is mapped by finding the intersection of the gaze ray and the environment.

1 Introduction

Analyzing egocentric video to predict what the camera-wearer is going to do next is an important task in egocentric vision with applications in Augmented Reality (AR) and robotics. In order for an AI assistant in a pair of smart eyeglasses to be helpful, it should be able to anticipate what a person might be trying to do and provide advice or interventions before a problem occurs. As a result, previous works have developed deep learning models to forecast ego-motion, gaze, and hand trajectories from egocentric videos [22, 28, 32]. However, most prior methods focus on isolated modality signals (e.g., only gaze or hand motion) and fail to model their multimodal interdependencies. A key missing element in these prior works is a detailed consideration of the visumotor control system which underlies all goal-directed human movement.

Psychological research underscores the predictive nature of visuomotor coordination, particularly in how visual memory informs motor planning. Hayhoe et al. [18] observed that during natural tasks like making a sandwich, individuals rely on visual information from previous fixations to plan and coordinate movements over the course of several seconds. This reliance on stored visual memory allows humans to anticipate object locations and properties, ensuring seamless interaction with the environment. We hypothesize that understanding and modeling these predictive visuomotor behaviors can provide more effective forecaseting capabilities for robotics, VR/AR, and human-computer interaction, where anticipating human movement can enhance real-time decision-making and system adaptability.

Beyond egocentric vision, visuomotor prediction has also been explored in robotics to enhance manipulation and assistive tasks. Learning from human demonstrations [12, 36, 51] has enabled robots to generalize across diverse actions. However, these methods often rely on task-specific datasets and do not model full-body visuomotor coordination in natural settings. Bridging the gap between egocentric perception and full-body visuomotor learning remains an open challenge.

We propose the first comprehensive predictive visuomotor learning framework, which jointly models head pose, gaze direction, and upper-body joint movements in 3D space. Unlike prior approaches that treat these signals independently, our model explicitly captures their temporal dependencies by encoding kinematic sequences and generating future motion trajectories using a diffusion model. By conditioning on multimodal observations, our method learns structured visuomotor patterns, enabling more temporally consistent and accurate movement predictions.

Our approach is enabled by the recent emergence of the EgoExo4D [15] and Nymeria [37] datasets, which contain comprehensive 3D annotations and make it possible to quantitatively evaluate visumotor forecasting performance. Specifically, we evaluate our approach on EgoExo4D, leveraging its 3D annotations of gaze, head pose, and body motion. Averaged over a 1-second forecasting horizon, our model achieves an average error of 59 mm for visuomotor translation (head position, gaze ray endpoint, and upper-body joints) and a head rotation error of 13 degrees, demonstrating promising performance across a diverse set of daily activities. By jointly modeling these visuomotor signals, our method provides deeper insights into their temporal dynamics, advancing the study of predictive human motion modeling. Our contributions are summarized as follows:

•

We formulate a forecasting-based task for human visuomotor modeling, integrating head pose, gaze direction, fixation, and upper-body motion to capture temporal dependencies.
•

We extend a diffusion-based framework that integrates egocentric vision and kinematic sequences, leveraging multimodal fusion to enhance temporal coherence and prediction accuracy.
•

We conduct extensive evaluations on the large-scale EgoExo4D dataset, demonstrating consistently strong performance across diverse real-world activities.
•

Our analysis includes extensive quantitative and qualitative evaluations, with a comprehensive ablation study assessing the impact of each multimodal component on accuracy.

2 Related Work

2.1 Human Visuomotor Coordination

Human visuomotor coordination is fundamental to action planning and execution, integrating first-person perspective visual perception, head orientation, and proprioception to enable fluid, goal-directed movements. Unlike purely reactive control, it operates predictively, allowing for smooth and adaptive actions despite sensory and neural delays [41, 48, 45, 50, 3]. Research in neuroscience and motor control highlights that head movement, gaze, and body posture work together to anticipate motion trajectories [31, 30, 5]. Head orientation provides a stable spatial reference, aligning sensory input with motor execution, especially in dynamic environments[16]. Proprioception further refines this process, allowing the brain to track limb positions and adjust movements accordingly[44]. These multimodal cues are tightly coupled—humans naturally orient their head and upper body before executing reaching, stepping, or object manipulation tasks [24]. This predictive integration is essential for both action preparation and real-time adaptation[30, 44].

Inspired by these biological mechanisms, our work explicitly integrates head orientation, gaze, and egocentric perception into motion forecasting, not as independent signals but bridging neuroscience insights with real-world human movement prediction as a structured visuomotor coordination representation.

2.2 Modeling Predictive Visuomotor Coordination

Existing human motion forecasting methods often focus on isolated sub-components rather than their integration. Gaze forecasting predicts future fixation points from past gaze trajectories, often using egocentric video [52, 28, 29], but it does not model how gaze directs motor actions. Hand trajectory forecasting predicts hand motion in first-person views, typically for object interactions [35, 22, 14], yet it largely ignores head movement and body posture, which naturally influence hand coordination. Full-body motion forecasting predicts skeletal motion based on past joint positions [13, 40, 42, 39, 19, 17, 8], but lower-body motion is often dictated by external terrain constraints rather than internal visuomotor coordination. In contrast, upper-body motion is directly linked to fine motor tasks, object interactions, and skill-based activities, where visuomotor coordination plays a critical role. Thus, we focus on upper-body dynamics, ensuring our approach remains independent of environmental priors while maintaining relevance across diverse skilled activities.

Egocentric vision has emerged as a key modality for studying human visuomotor coordination, as it provides a first-person perspective of perception and action. Unlike third-person views, egocentric video directly captures the relationship between visual attention, head movement, and motor execution, making it well-suited for modeling predictive visuomotor behaviors [52, 22, 32, 46]. This perspective is particularly valuable for understanding how humans coordinate gaze, head, and body movements in dynamic environments.

Another line of research incorporates environmental cues for motion prediction. Gaze-informed full-body forecasting has been explored in navigation tasks, using gaze to anticipate walking trajectories [27, 53, 49, 21]. However, these tasks are simpler, as locomotion follows scene constraints rather than requiring complex visuomotor coordination. Full-body forecasting in interactive activities integrates environmental affordances to predict human motion [53, 2], but such methods heavily rely on external cues like object locations or scene semantics, making them less generalizable to unstructured or unseen environments.

In contrast, our work addresses a more complex and generalizable problem by unifying head orientation, gaze, and upper-body motion as predictive signals for motion forecasting. By focusing primarily on internal visuomotor cues, our model learns a biologically grounded representation applicable across diverse activities, without explicitly relying on structured environmental priors.

2.3 Bridging Visuomotor and Imitation Learning

Recent advances in visuomotor imitation learning have enabled robots to acquire complex skills from human demonstrations [12, 51, 47, 36, 38, 54, 26, 6, 1, 23]. Works like EgoMimic [25] explore direct learning from human video data, removing the need for explicit robot data collection. However, modeling human visuomotor coordination remains a challenge, as it involves intricate couplings between gaze, head, and body movements that reflect hidden decision-making processes [34].

To simplify learning, many existing methods instruct demonstrators to behave “robot-like” by minimizing natural head and body movements [4, 33, 55, 43]. While effective for policy learning, this approach loses the richness of natural human behavior, making it difficult to generalize beyond constrained demonstrations.

Our work contributes to this direction by providing a predictive model of human visuomotor coordination, capturing the structured dependencies between head pose, gaze, and upper-body motion. Unlike imitation learning approaches that focus on end-to-end policy learning, our model predicts natural visuomotor behavior from in-the-wild datasets [15]. By explicitly modeling visuomotor coordination, our framework can serve as a foundation for future imitation learning research, providing data-driven insights into human movement dynamics that can improve robot learning from human demonstrations.

Refer to caption — Figure 2: Visualizing the Visuomotor Coordination Representation by mapping it onto a human mesh for better interpretability.

3 Method

We define the Visuomotor Coordination Representation (Sec. 3.1) and introduce the forecasting task (Sec. 3.2). To remove absolute head motion and ensure temporal and spatial consistency, we apply a state canonicalization process (Sec. 3.3). We present our diffusion-based generative framework for visuomotor learning in (Sec. 3.4).

3.1 Human Visuomotor Representation

To effectively capture human visuomotor coordination, we consider three essential components: the head pose, which establishes a spatial reference frame for movement; the gaze point, which serves as an indicator of visual attention and intent; and the upper-body joints, which provide motion cues relevant to interaction and task execution. These components collectively define the Visuomotor Coordination Representation (VCR) by encapsulating the key factors that drive human visuomotor behavior.

We formally define a visuomotor state: $S=\{H,G,U\}$ , where $H=(\mathbf{p}_{head},\mathbf{R}_{head})$ represents the head pose with position $\mathbf{p}_{head}\in\mathbb{R}^{3}$ and orientation $\mathbf{R}_{head}\in SO(3)$ . The gaze component $G$ is represented by the gaze endpoint $\mathbf{g}$ , defined as $\mathbf{g}=\mathbf{p}_{head}+\lambda\mathbf{d}_{gaze}$ , where $\mathbf{d}_{gaze}\in\mathbb{R}^{3}$ is the unit gaze direction derived from $\mathbf{R}_{head}$ , and $\lambda$ is a predefined scalar controlling the gaze ray length. Finally, $U=\{\mathbf{j}_{i}\in\mathbb{R}^{3}\mid i=1,\dots,6\}$ represents the upper-body joint positions, including shoulders, elbows, and wrists. Fig. 2 visualizes an example of VCR from Nymeria [37].

3.2 Predicting Visuomotor Coordination

Given a Visuomotor Coordination Representation (S) sequence $S_{t-\tau:t}=\{S_{t},S_{t-1},\dots,S_{t-\tau}\}$ and its corresponding egocentric video clip $E\in\mathbb{R}^{C\times T\times H\times W}$ , our goal is to predict the future visuomotor states $\hat{S}_{t+1:t+\Delta}$ over a prediction horizon of $\Delta$ steps. This task requires joint modeling of the temporal evolution of human motion, including head pose, gaze, and upper-body joint dynamics, based on past visual and kinematic observations.

Formally, we define the Visuomotor Forecasting Task as learning a function:

\hat{S}_{t+1:t+\Delta}=f(S_{t-\Delta:t},E_{t-\Delta:t}),

where $f(\cdot)$ models the transition dynamics of visuomotor states given historical motion and egocentric perception, from which enables anticipation of human visuomotor coordination in dynamic environments.

3.3 Visuomotor State Canonicalization

We canonicalize kinematic data to ensure consistency across time and spatial frames. To achieve this, we standardize all motion data by defining a stable reference frame based on the final observation. Specifically, we set the head pose $H_{t}=(\mathbf{p}_{\text{head},t},\mathbf{R}_{\text{head},t})$ in the last observed step to a canonical state, applying the transformation $H^{c}_{t}=\Phi(H_{t})$ , where $\Phi$ normalizes the head pose by aligning its orientation to identity rotation $\mathbf{I}$ and positioning it at the origin $\mathbf{0}$ . This transformation ensures consistency across time, allowing the model to learn visuomotor coordination independently of absolute head motion.

To preserve intra-state spatial consistency, the same transformation $\Phi$ is applied to the gaze endpoint $\mathbf{g}_{t}$ and upper-body joint positions $U_{t}$ through $\mathbf{g}^{c}_{t}=\Phi(\mathbf{g}_{t})$ and $U^{c}_{t}=\Phi(U_{t})$ . This results in $S^{c}_{t}=\Phi(S_{t})$ , where all visuomotor elements remain consistent relative to the canonical head frame, ensuring the model captures meaningful coordination patterns without being affected by head motion variations.

To preserve inter-state temporal consistency, we extend this operation to all preceding and future kinematic states. For kinematic states $S_{i}$ with $i\in[t-\tau,t+\Delta]$ , each state is first transformed relative to $S_{t}$ before being mapped to the canonical frame. As a result, we have $S^{c}_{i}=T_{i\to t}(S_{i})\circ S^{c}_{t},$ where $S^{c}_{t}=\Phi(S_{t})$ is the canonicalized reference state. This ensures that all frames remain correctly aligned relative to each other, preserving both temporal and spatial consistency.

This process removes absolute head motion while preserving the relative relationships between the head, gaze, and upper-body joints. It enables the model to learn visuomotor coordination independently of viewpoint variations, improving robustness and generalization. An example of this processing pipeline is shown in Fig. 3. Note that $S^{c}_{t}$ is used in the experiments; however, for simplicity, we continue using the notation $S_{t}$ throughout this paper.

3.4 Model Architecture

We first describe how we extract and fuse multimodal information to construct the conditioning feature for diffusion (Sec. 3.4.1), followed by the diffusion-based visuomotor prediction process (Sec. 3.4.2). Figure 4 provides an overview of our model.

3.4.1 Conditioning Feature Extraction

Multimodal Feature Encoding Given a sequence of visuomotor states $S_{t-\tau:t}$ sampled at 10 fps, where each state $S_{t}=\{H_{t},G_{t},U_{t}\}$ consists of head pose, gaze direction, and upper-body joints, we project them into a latent space using learned functions: $\mathbf{k}_{t}^{H}=f_{h}(H_{t}),\quad\mathbf{k}_{t}^{G}=f_{g}(G_{t}),\quad\mathbf{k}_{t}^{U}=f_{u}(U_{t}).$ Egocentric RGB frames $E_{t-\tau:t}$ are sampled at 4 fps, providing complementary contextual information about the environment and task-relevant objects. A single visual embedding $\mathbf{v}=\mathcal{F}_{\text{vis}}(E_{t-\tau:t})\in\mathbb{R}^{128}$ is extracted from the sequence using a 3D ResNet backbone.

Multimodal Feature Fusion. While egocentric frames always contain cues directly related to head and gaze orientation, their relevance to full visuomotor coordination is uncertain due to partial occlusions and viewpoint limitations. To mitigate this, instead of using a single kinematic representation, we construct two separate kinematic representations: one capturing head and gaze features, defined as $\mathbf{k}^{\text{hg}}_{t}=\text{Concat}(\mathbf{k}^{\text{head}}_{t},\mathbf{k}^{\text{gaze}}_{t})$ , and another incorporating full-body motion as $\mathbf{k}^{\text{hga}}_{t}=\text{Concat}(\mathbf{k}^{\text{head}}_{t},\mathbf{k}^{\text{gaze}}_{t},\mathbf{k}^{\text{arm}}_{t})$ .

We apply cross-attention separately to $\mathbf{k}^{\text{hg}}_{t}$ , which captures viewpoint and attentional dynamics, and to $\mathbf{k}^{\text{hga}}_{t}$ , which additionally includes upper-body motion cues. This structured fusion allows the model to selectively incorporate spatial and attentional signals while maintaining robustness to missing or ambiguous visual information. Formally, we compute:

\displaystyle\mathbf{k}^{\prime\text{hg}}_{t}

\displaystyle=\mathcal{A}(\mathbf{k}^{\text{hg}}_{t},\mathbf{v},\mathbf{v}),\quad\mathbf{k}^{\prime\text{hga}}_{t}

\displaystyle=\mathcal{A}(f_{\text{proj}}(\mathbf{k}^{\text{hga}}_{t}),\mathbf{v},\mathbf{v}).

The final fused representation is obtained by summing the attended features: $\mathbf{k}_{t}^{\text{fused}}=\mathbf{k}^{\prime\text{hg}}_{t}+\mathbf{k}^{\prime\text{hga}}_{t}$ , which is then passed to a Transformer-based temporal encoder $\mathcal{T}$ for sequential modeling. The output is flattened into a single conditioning feature $\mathbf{c}$ , which is used as input to the denoiser $\mathcal{D}$ .

3.4.2 Diffusion-Based Visuomotor Prediction

We follow [9] to formulate visuomotor forecasting as a denoising diffusion process, where the model learns to iteratively refine a noisy sequence into a future trajectory.

Forward Diffusion Process. Following the standard denoising diffusion probabilistic model (DDPM) [20], we define the forward process as a Markovian sequence that progressively adds Gaussian noise to the ground-truth future visuomotor states $S_{t+1:T}$ :

q(S_{t}|S_{0})=\mathcal{N}(S_{t};\sqrt{\bar{\alpha}_{t}}S_{0},(1-\bar{\alpha}_{t})\mathbf{I})

where $\bar{\alpha}_{t}$ is a pre-defined noise schedule. This process converts the original data distribution into an isotropic Gaussian in the latent space.

Reverse Denoising Process. The model learns to iteratively denoise the corrupted future states using a neural network parameterized by $\theta$ . The reverse process is defined as:

p_{\theta}(S_{t-1}|S_{t},\mathbf{c})=\mathcal{N}(S_{t-1};\mu_{\theta}(S_{t},t,\mathbf{c}),\sigma_{\theta}^{2}\mathbf{I})

where $\mu_{\theta}$ is the predicted mean, $\sigma_{\theta}^{2}$ is the variance, and the conditioning feature $\mathbf{c}$ remains constant throughout the denoising process.

The model is trained using a standard denoising loss:

\mathcal{L}=\mathbb{E}_{S_{0},t,\boldsymbol{\epsilon}}\left[||\boldsymbol{\epsilon}-\epsilon_{\theta}(S_{t},t,\mathbf{c})||^{2}\right]

where $\epsilon_{\theta}$ represents the noise prediction network.

At inference time, we sample future visuomotor trajectories by iteratively applying the learned reverse process, starting from a Gaussian prior.

3.4.3 Implementation Details

Training Details. The visual encoder is pre-trained on KINETICS400_V1, while both the transformer module and the DDPM diffusion model are trained from scratch. The model is implemented in PyTorch and trained for 400 epochs using the AdamW optimizer with a learning rate of $5\times 10^{-4}$ . Training is conducted on a single H100 GPU, with a batch size of 384, requiring approximately 8 hours to complete.

4 Experiments

We conduct extensive experiments to evaluate our predictive visuomotor learning framework. We first introduce the datasets used in our study (Sec. 4.1), followed by the evaluation metrics (Sec. 4.2) used to assess prediction accuracy. We then present the baselines for comparison (Sec. 4.3) and analyze the impact of different model components through ablation studies (Sec. 4.4). Finally, we provide both qualitative (Sec. 4.6) and quantitative (Sec. 4.5) results to demonstrate the effectiveness of our approach.

4.1 Datasets

EgoExo4D [15] is a large-scale multimodal egocentric-exocentric video dataset collected by the Ego4D Consortium. It utilizes Aria glasses [10] alongside four GoPro cameras to capture aligned first-person (egocentric) and third-person (exocentric) views of skilled activities across more than 130 scene contexts, totaling approximately 88 hours of egocentric footage. The dataset provides head pose and gaze annotations computed by Meta’s Multimodal Perception Service (MPS), ensuring accurate SLAM-based pose estimation. Additionally, full-body 3D joint positions are annotated from the exocentric videos. While EgoExo4D encompasses a diverse range of activities, not all are equally suited for studying visuomotor coordination. We select a subset of activities where participants manipulate objects and navigate their environment in ways that naturally integrate visual attention with motor actions. The selected scenarios include Basketball, Cooking, Bike Fixing, and Health-related tasks, resulting in 23,372 training samples and 5,126 testing samples, with a total duration of approximately 15.8 hours. Detailed information about the data cleaning pipeline and per-class statistical distribution is available in Sec. B of our supplementary materials.

4.2 Evaluation Metrics

To comprehensively evaluate our predictive visuomotor learning task, we adopt a diverse set of metrics that assess structural consistency, positional accuracy, and orientation alignment. These metrics are computed over the visuomotor representation $S=\{H,G,U\}$ , as defined in Sec. 3.1.

Structural Consistency. PA-MPJPE (mm) measures the structural consistency of predicted visuomotor coordination. Unlike absolute error metrics, PA-MPJPE aligns the predicted and ground-truth poses via a rigid transformation, ensuring that the error primarily reflects internal joint structure deviations rather than global displacement. This metric is computed over $\{\mathbf{p}_{head},G,U\}$ .

Position Accuracy. We measure the Euclidean distance between the predicted and ground-truth positions of key body parts. Specifically, Head, Gaze, and Hand Errors (mm) correspond to $\mathbf{p}_{head}$ , $G$ , and the wrist joints $\mathbf{j}_{i}$ for $i\in\{5,6\}$ , which are a subset of $U$ , respectively.

Orientation Accuracy. We evaluate the accuracy of head rotation with Head Rotation Error (HRE) (degree) measures the angular deviation between the predicted and ground-truth head orientations.

4.3 Baselines

To evaluate the effectiveness of our method, we compare it against two naïve interpolation baselines and two learning-based methods, including a Diffusion Policy model.¹¹1Our work is concurrent with EgoCast [11] and EgoAgent [7], but a direct comparison is not feasible due to differences in task definitions and input modalities, as well as the lack of publicly available implementations at the time of writing.

•

Constant Pose: assumes no future motion, directly copying the last observed visuomotor state for all future time steps. It serves as a lower bound, representing a scenario where no predictive modeling is performed.
•

Constant Velocity: estimates future motion using first-order linear extrapolation, assuming a constant velocity based on the last observed state transition. It provides a simple motion forecasting heuristic.
•

Transformer Model: is a standard Transformer-based sequence-to-sequence predictor that models motion evolution through self-attention. We concatenate pose sequences with visual embeddings extracted from ResNet18 and feed them into a 3-layer Transformer encoder, followed by a regression head to predict future visuomotor states.
•

Diffusion Policy (CNN-Based) [9]: formulates motion prediction as a denoising diffusion process, where a neural network iteratively refines a noisy motion sequence into a plausible future trajectory. We apply this framework to our setting by extracting egocentric frame features using a pre-trained ResNet18, concatenating them with raw visuomotor observations, and flattening the resulting feature representation to condition a U-Net denoiser.

Methods	PA-MPJPE, %	Head Pos., %	Gaze Pos., %	Hand Pos., %	Head Rot., %
Constant Pose	68.3, +16.6	184, +73.6	193, +55.6	274, +45.7	16.7, +25.6
Constant Velocity	109, +84.7	161, +51.9	201, + 62.1	436, +132	18.5, +39.1
Transformer Encoder + MLP	65.3, +10.7	119, +12.3	135, +8.1	211, +12.2	13.8, +4.5
Diffusion Policy-CNN [9]	64.1, +8.6	112, +5.7	132, +6.5	208, +10.6	13.9, +5.3
Ours	59	106	124	188	13.2

Table 1: Quantitative comparison of our method against baselines. Lower values indicate better performance.

4.4 Ablations

To assess the contribution of different components in our visuomotor prediction model, we conduct two types of ablations: one focusing on specific input and output signals, and another analyzing the impact of different modalities.

Input-Output Ablations. We first examine the role of key visuomotor signals by selectively removing them from the input or output space:

•

w/o Head Rotation: Removes head rotation from the output and optionally from the input to test its necessity.
•

w/o Head Rotation & Gaze: Removes both head rotation and gaze from the output and optionally from the input to evaluate their impact.
•

w/o Head: Removes all head-related information (position and rotation) from the output and optionally from the input.
•

w/o Gaze: Removes gaze from the output and optionally from the input to test its role.

Signal Modality Ablations. We further analyze the contribution of different sensory and temporal modalities:

•

w Last Step Arm: Uses only the last observed arm configuration instead of the full motion history to evaluate the importance of temporal context.
•

w/o Egocentric Frame: Removes egocentric visual input, leaving only kinematic information, to assess the role of first-person vision in visuomotor prediction.

Ablations	Input	Head Pos., %	Gaze Pos., %	Hand Pos., %	Head Rot., %	PA-MPJPE, %
Complete Visuomotor	-	106	124	188	13.2	59
Input-Output Ablations
Head Rotation	✔	108, +1.9	126, +1.6	188, +0.0	✗	Metric unavailable due to incomplete output space
Head Rotation	✗	111, +4.7	130, +4.8	195, +3.7	✗
Head Rotation & Gaze	✔	109, +2.8	✗	190, +1.1	✗
Head Rotation & Gaze	✗	112, +5.7	✗	196, +4.3	✗
Head	✔	✗	127, +2.4	190, +1.1	✗
Head	✗	✗	132, +6.5	194, +3.2	✗
Gaze	✔	109, +2.8	✗	190, +1.1	13.4, +1.5
Gaze	✗	111, +4.7	✗	194, +3.2	13.9, + 4.5
Signal Modality Ablations
w Last Step Arm	-	113, + 6.6	141, + 5.2	199, + 5.9	13.7 , + 3.8	61, + 3.4
w/o Egocentric Frame	-	111, + 4.7	130, + 4.8	193, + 2.7	14.1 , + 6.0	60, + 1.7

Table 2: Ablation Study on Input Configuration and Signal Modalities. The table presents two sets of ablations: (1) Input-Output Ablations, which evaluate the impact of removing specific kinematic inputs (head rotation, head position, gaze, and upper-body joints) on prediction accuracy, and (2) Signal Modality Ablations, which assess the effect of removing higher-level input signals such as the last-step arm pose and egocentric frames. The Complete Visuomotor row represents the full model, achieving the lowest error across all metrics.

4.5 Quantitative Results

Tables 1 and 2 summarize our results: baseline comparisons, ablation studies on different input modalities, and per-class prediction performance, respectively. The best results are highlighted in bold, and lower values across all metrics indicate better performance (Sec. 4.1).

Comparison with Baseline Methods. Table 1 compares our method with four baselines for visuomotor prediction. We first examine naïve interpolation baselines, Constant Pose and Constant Velocity, which assume either no motion or simple extrapolation. As expected, these baselines perform poorly. Constant Pose yields a PA-MPJPE of 68.3 and a hand error of 274, showing the inadequacy of static predictions. Constant Velocity performs even worse, with PA-MPJPE reaching 264 and hand error rising to 436, indicating that simple extrapolation is insufficient for modeling complex visuomotor behavior.

Learning-based approaches significantly outperform interpolation baselines, demonstrating the need for temporal modeling and multimodal integration. Among prior works, Diffusion Policy-CNN and Transformer Encoder + MLP serve as strong baselines. Our model achieves a PA-MPJPE of 59, improving upon Diffusion Policy-CNN by 8.6% and the Transformer-based baseline by 10.7%. Similarly, all of our position errors outperform them, and our model improves the head and gaze error of Diffusion Policy by 5.7% and 6.5%, respectively. These improvements indicate that our method generates more precise motion predictions while maintaining realistic visuomotor coordination.

Notably, our model achieves the largest improvement in predicting hand position, which is the most challenging sub-task. This suggests that our approach effectively captures head-eye-hand coordination, leveraging structured visuomotor representations for fine-grained motion prediction.

When evaluating the orientations of head and gaze, our model achieves 13.2 degree as the average error across all steps, improving upon Diffusion Policy-CNN by 4.5% and the Transformer method by 6%. The fact that our model performs well across both head rotation (HRE) and translation suggests that it effectively learns a unified visuomotor representation, handling different motion modalities within the same framework.

Ablation Study. Table 2 evaluates the impact of removing different input components on visuomotor prediction. The Complete Visuomotor setting (row 0) includes all available inputs (head pose, gaze, and upper-body joints), while the ablations systematically remove each to examine its role.

Removing head rotation (row 2) increases head position error from 108 to 111, while additionally removing gaze input (row 4) further degrades performance to 112. Gaze position error rises from 126 to 130, and hand position error increases from 188 to 196, highlighting the importance of head and gaze signals for accurate motion prediction. When all head-related inputs are removed (row 6), gaze error increases from 127 to 132, and hand error from 190 to 194, indicating that head motion affects overall visuomotor coordination. The absence of gaze input (row 8) also increases head rotation error from 13.4 to 13.9, confirming its role in stabilizing head orientation.

The lower section of Table 2 examines high-level signal ablations. Preserving only the last-step arm pose (w/ Last Step Arm) increases head, gaze, and hand errors by 6.6%, 5.2%, and 5.9%, respectively, suggesting that a single arm pose lacks sufficient temporal context for predicting upper-body motion. Removing egocentric vision (w/o Egocentric Frame) leads to 4.7% higher head position error and a 6% increase in head rotation error, reinforcing the role of visual context in stabilizing head and gaze coordination. Interestingly, gaze error decreases slightly (124 to 130), suggesting a shift toward kinematic reliance for gaze estimation.

These results confirm that multimodal integration enhances predictive visuomotor coordination: temporal kinematic history improves hand predictions, while egocentric vision stabilizes head and gaze alignment.

4.6 Qualitative Results

Figure 5 illustrates the predicted visuomotor coordination across diverse real-world scenes. The model effectively captures head, gaze, and upper-body dynamics, maintaining smooth and temporally consistent motion. As shown in Rows (a)-(d), it successfully anticipates head and gaze shifts, even when adapting to scene constraints, demonstrating its ability to infer visuomotor intent from prior observations.

However, failure cases occur in scenarios with rapid, unexpected movements or occlusions, where subtle cues are insufficient for precise coordination. While our model still produces reasonable predictions, it may misalign with the ground truth. Row (e) presents a typical failure case: as the basketball bounces quickly—a nuance observable only in the last frame of the egocentric input—the person rapidly shifts their body rightward to catch it. In contrast, our prediction assumes a standard catching motion, failing to adapt to the sudden trajectory change. Despite these challenges, overall prediction quality remains strong, reinforcing the effectiveness of multimodal integration in learning visuomotor coordination. Future improvements could involve incorporating explicit contact modeling or leveraging environment-aware reasoning to enhance robustness in highly dynamic tasks. For a more comprehensive demonstration, we provide video examples in the supplementary materials.

5 Conclusion

We introduced a forecasting-based task for human visuomotor modeling, where the goal is to predict future head pose, gaze, and upper-body motion from egocentric visual and kinematic observations. To achieve this, we proposed the Visuomotor Coordination Representation, which learns structured temporal dependencies across multimodal signals. We further extended a diffusion-based motion modeling framework integrating egocentric vision and kinematic sequences, enabling temporally coherent and accurate visuomotor predictions. Our approach was evaluated on the challenging large-scale EgoExo4D dataset, demonstrating strong generalization across diverse real-world activities. Through extensive experiments, we showed that multimodal integration plays a crucial role in improving visuomotor modeling. Our results provide insights into structured human motion representation, contributing to applications in robotics, human-computer interaction, and assistive technologies.

References

Aldaco et al. [2024] Jorge Aldaco, Travis Armstrong, Robert Baruch, Jeff Bingham, Sanky Chan, Kenneth Draper, Debidatta Dwibedi, Chelsea Finn, Pete Florence, Spencer Goodrich, et al. Aloha 2: An enhanced low-cost hardware for bimanual teleoperation. arXiv preprint arXiv:2405.02292, 2024.
Ashutosh et al. [2024] Kumar Ashutosh, Georgios Pavlakos, and Kristen Grauman. Fiction: 4d future interaction prediction from video. arXiv preprint arXiv:2412.00932, 2024.
Avraham et al. [2019] Guy Avraham, Erez Sulimani, Ferdinando A Mussa-Ivaldi, and Ilana Nisky. Effects of visuomotor delays on the control of movement and on perceptual localization in the presence and absence of visual targets. Journal of neurophysiology, 122(6):2259–2271, 2019.
Bahl et al. [2022] Shikhar Bahl, Abhinav Gupta, and Deepak Pathak. Human-to-robot imitation in the wild. arXiv preprint arXiv:2207.09450, 2022.
Bizzi et al. [1984] Emilio Bizzi, Neri Accornero, William Chapple, and Neville Hogan. Posture control and trajectory formation during arm movement. Journal of Neuroscience, 4(11):2738–2744, 1984.
Black et al. [2024] Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi_0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024.
Chen et al. [2025] Lu Chen, Yizhou Wang, Shixiang Tang, Qianhong Ma, Tong He, Wanli Ouyang, Xiaowei Zhou, Hujun Bao, and Sida Peng. Acquisition through my eyes and steps: A joint predictive agent model in egocentric worlds. arXiv preprint arXiv:2502.05857, 2025.
Chen et al. [2023] Ling-Hao Chen, Jiawei Zhang, Yewen Li, Yiren Pang, Xiaobo Xia, and Tongliang Liu. Humanmac: Masked motion completion for human motion prediction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9544–9555, 2023.
Chi et al. [2023] Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, page 02783649241273668, 2023.
Engel et al. [2023] Jakob Engel, Kiran Somasundaram, Michael Goesele, Albert Sun, Alexander Gamino, Andrew Turner, Arjang Talattof, Arnie Yuan, Bilal Souti, Brighid Meredith, et al. Project aria: A new tool for egocentric multi-modal ai research. arXiv preprint arXiv:2308.13561, 2023.
Escobar et al. [2025] Maria Escobar, Juanita Puentes, Cristhian Forigua, Jordi Pont-Tuset, Kevis-Kokitsi Maninis, and Pablo Arbeláez. Egocast: Forecasting egocentric human pose in the wild. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2025.
Finn et al. [2017] Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. One-shot visual imitation learning via meta-learning. In Conference on robot learning, pages 357–368. PMLR, 2017.
Fragkiadaki et al. [2015] Katerina Fragkiadaki, Sergey Levine, Panna Felsen, and Jitendra Malik. Recurrent network models for human dynamics. In Proceedings of the IEEE international conference on computer vision, pages 4346–4354, 2015.
Grauman et al. [2022] Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022.
Grauman et al. [2024] Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19383–19400, 2024.
Graziano [2006] Michael Graziano. The organization of behavioral repertoire in motor cortex. Annu. Rev. Neurosci., 29(1):105–134, 2006.
Guo et al. [2023] Wen Guo, Yuming Du, Xi Shen, Vincent Lepetit, Xavier Alameda-Pineda, and Francesc Moreno-Noguer. Back to mlp: A simple baseline for human motion prediction. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 4809–4819, 2023.
Hayhoe et al. [2003] Mary M Hayhoe, Anurag Shrivastava, Ryan Mruczek, and Jeff B Pelz. Visual memory and motor planning in a natural task. Journal of vision, 3(1):6–6, 2003.
Hernandez et al. [2019] Alejandro Hernandez, Jurgen Gall, and Francesc Moreno-Noguer. Human motion prediction via spatio-temporal inpainting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7134–7143, 2019.
Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
Hu et al. [2024] Zhiming Hu, Syn Schmitt, Daniel Häufle, and Andreas Bulling. Gazemotion: Gaze-guided human motion forecasting. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 13017–13022. IEEE, 2024.
Jia et al. [2022] Wenqi Jia, Miao Liu, and James M Rehg. Generative adversarial network for future hand segmentation from egocentric video. In European Conference on Computer Vision, pages 639–656. Springer, 2022.
Jiang et al. [2024] Zhenyu Jiang, Yuqi Xie, Kevin Lin, Zhenjia Xu, Weikang Wan, Ajay Mandlekar, Linxi Fan, and Yuke Zhu. Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning. arXiv preprint arXiv:2410.24185, 2024.
Johansson and Flanagan [2009] Roland S Johansson and J Randall Flanagan. Coding and use of tactile signals from the fingertips in object manipulation tasks. Nature Reviews Neuroscience, 10(5):345–359, 2009.
Kareer et al. [2024] Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video. arXiv preprint arXiv:2410.24221, 2024.
Kim et al. [2024] Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024.
Kratzer et al. [2020] Philipp Kratzer, Simon Bihlmaier, Niteesh Balachandra Midlagajni, Rohit Prakash, Marc Toussaint, and Jim Mainprice. Mogaze: A dataset of full-body motions that includes workspace geometry and eye-gaze. IEEE Robotics and Automation Letters, 6(2):367–373, 2020.
Lai et al. [2022] Bolin Lai, Miao Liu, Fiona Ryan, and James M Rehg. In the eye of transformer: Global-local correlation for egocentric gaze estimation. arXiv preprint arXiv:2208.04464, 2022.
Lai et al. [2023] Bolin Lai, Fiona Ryan, Wenqi Jia, Miao Liu, and James M Rehg. Listen to look into the future: Audio-visual egocentric gaze anticipation. arXiv preprint arXiv:2305.03907, 2023.
Land and Hayhoe [2001] Michael F Land and Mary Hayhoe. In what ways do eye movements contribute to everyday activities? Vision research, 41(25-26):3559–3565, 2001.
Lappi and Mole [2018] Otto Lappi and Callum Mole. Visuomotor control, eye movements, and steering: A unified approach for incorporating feedback, feedforward, and internal models. Psychological bulletin, 144(10):981, 2018.
Li et al. [2023] Jiaman Li, Karen Liu, and Jiajun Wu. Ego-body pose estimation via ego-head pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17142–17151, 2023.
Li et al. [2024] Jinhan Li, Yifeng Zhu, Yuqi Xie, Zhenyu Jiang, Mingyo Seo, Georgios Pavlakos, and Yuke Zhu. Okami: Teaching humanoid robots manipulation skills through single video imitation. In 8th Annual Conference on Robot Learning, 2024.
Lin et al. [2025] Toru Lin, Kartik Sachdev, Linxi Fan, Jitendra Malik, and Yuke Zhu. Sim-to-real reinforcement learning for vision-based dexterous manipulation on humanoids. arXiv preprint arXiv:2502.20396, 2025.
Liu et al. [2022] Shaowei Liu, Subarna Tripathi, Somdeb Majumdar, and Xiaolong Wang. Joint hand motion and interaction hotspots prediction from egocentric videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3282–3292, 2022.
Lynch et al. [2019] Corey Lynch, Mohi Khansari, Ted Xiao, Vikash Kumar, Jonathan Tompson, Sergey Levine, and Pierre Sermanet. Learning latent plans from play. In Conference on Robot Learning, 2019.
Ma et al. [2024] Lingni Ma, Yuting Ye, Fangzhou Hong, Vladimir Guzov, Yifeng Jiang, Rowan Postyeni, Luis Pesqueira, Alexander Gamino, Vijay Baiyya, Hyo Jin Kim, et al. Nymeria: A massive collection of multimodal egocentric daily motion in the wild. arXiv preprint arXiv:2406.09905, 2024.
Mandlekar et al. [2021] Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Martín-Martín. What matters in learning from offline human demonstrations for robot manipulation. In Conference on Robot Learning (CoRL), 2021.
Mao et al. [2019] Wei Mao, Miaomiao Liu, Mathieu Salzmann, and Hongdong Li. Learning trajectory dependencies for human motion prediction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9489–9497, 2019.
Martinez et al. [2017] Julieta Martinez, Michael J Black, and Javier Romero. On human motion prediction using recurrent neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2891–2900, 2017.
Nijhawan and Wu [2009] Romi Nijhawan and Si Wu. Compensating time delays with neural predictions: are predictions sensory or motor? Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 367(1891):1063–1078, 2009.
Pavllo et al. [2020] Dario Pavllo, Christoph Feichtenhofer, Michael Auli, and David Grangier. Modeling human motion with quaternion-based neural networks. International Journal of Computer Vision, 128:855–872, 2020.
Ren et al. [2025] Juntao Ren, Priya Sundaresan, Dorsa Sadigh, Sanjiban Choudhury, and Jeannette Bohg. Motion tracks: A unified representation for human-robot transfer in few-shot imitation learning. arXiv preprint arXiv:2501.06994, 2025.
Shadmehr and Krakauer [2008] Reza Shadmehr and John W Krakauer. A computational neuroanatomy for motor control. Experimental brain research, 185:359–381, 2008.
Shadmehr et al. [2010] Reza Shadmehr, Maurice A Smith, and John W Krakauer. Error correction, sensory prediction, and adaptation in motor control. Annual review of neuroscience, 33(1):89–108, 2010.
Tan et al. [2023] Shuhan Tan, Tushar Nagarajan, and Kristen Grauman. Egodistill: Egocentric head motion distillation for efficient video understanding. Advances in Neural Information Processing Systems, 36:33485–33498, 2023.
Wang et al. [2021] Chen Wang, Rui Wang, Ajay Mandlekar, Li Fei-Fei, Silvio Savarese, and Danfei Xu. Generalization through hand-eye coordination: An action space for learning spatially-invariant visuomotor control. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 8913–8920. IEEE, 2021.
Wolpert and Flanagan [2001] Daniel M Wolpert and J Randall Flanagan. Motor prediction. Current biology, 11(18):R729–R732, 2001.
Yan et al. [2023] Haodong Yan, Zhiming Hu, Syn Schmitt, and Andreas Bulling. Gazemodiff: Gaze-guided diffusion model for stochastic human motion prediction. arXiv preprint arXiv:2312.12090, 2023.
Zago et al. [2009] Myrka Zago, Joseph McIntyre, Patrice Senot, and Francesco Lacquaniti. Visuo-motor coordination and internal models for object interception. Experimental Brain Research, 192:571–604, 2009.
Zeng et al. [2021] Andy Zeng, Pete Florence, Jonathan Tompson, Stefan Welker, Jonathan Chien, Maria Attarian, Travis Armstrong, Ivan Krasin, Dan Duong, Vikas Sindhwani, et al. Transporter networks: Rearranging the visual world for robotic manipulation. In Conference on Robot Learning, pages 726–747. PMLR, 2021.
Zhang et al. [2017] Mengmi Zhang, Keng Teck Ma, Joo Hwee Lim, Qi Zhao, and Jiashi Feng. Deep future gaze: Gaze anticipation on egocentric videos using adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4372–4381, 2017.
Zheng et al. [2022] Yang Zheng, Yanchao Yang, Kaichun Mo, Jiaman Li, Tao Yu, Yebin Liu, C Karen Liu, and Leonidas J Guibas. Gimo: Gaze-informed human motion prediction in context. In European Conference on Computer Vision, pages 676–694. Springer, 2022.
Zhu et al. [2023] Yifeng Zhu, Zhenyu Jiang, Peter Stone, and Yuke Zhu. Learning generalizable manipulation policies with object-centric 3d representations. arXiv preprint arXiv:2310.14386, 2023.
Zhu et al. [2024] Yifeng Zhu, Arisrei Lim, Peter Stone, and Yuke Zhu. Vision-based manipulation from single human video with open-world object graphs. arXiv preprint arXiv:2405.20321, 2024.

Appendix A Additional Video Demo

Due to space limitations, the main paper only presents static visualizations. In this supplementary material, we provide a richer set of qualitative demos showcasing our model’s predictions across diverse scenarios. These include extended visualizations of predicted head pose, gaze, and upper-body motion, highlighting both successful cases and failure modes. We also analyze common prediction errors, such as subtle signal leads to wrong predictions, and inherent challenge from unforseeable human motion, to better illustrate the model’s strengths and limitations.

We encourage readers to view the full set of qualitative results in the provided demo videos for a more comprehensive understanding of our model’s performance.

Appendix B EgoExo4D Data Cleaning Pipeline

EgoExo4D provides detailed pose annotations by running off-the-shelf human pose estimation models on multiple exocentric camera views. However, occlusions frequently lead to missing body parts, resulting in occasional inaccuracies in the automatically generated annotations. Even with manually annotated corrections, these issues remain common due to unavoidable viewpoint limitations.

While Fiction [2] improves annotation quality by re-annotating filtered sequences using narration-based semantic cues, our goal is to model general visuomotor coordination without restricting the dataset based on activity type. Instead of manually filtering data based on semantics, we apply a 5-second sliding window to correct annotation errors using temporally adjacent frames. If a missing or incorrect joint annotation cannot be recovered within a reasonable range, we discard that frame.

After this cleaning process, our training and testing samples are drawn from the valid index list using a 20 steps sliding window, with a stride of 10 steps. This ensures that our model learns from reliable annotations while maintaining a broad range of natural visuomotor behaviors, aligning with our goal of capturing general coordination patterns rather than task-specific motions.

Task / $\Delta_{Avg}$	Basketball	Cooking	Bike	Health
PA-MPJPE	78 / 116	47 / 59	52 / 61	38 / 48
Head Pos.	16 / 35	12 / 18	11 / 19	11 / 17
Gaze Pos.	195 / 546	89 / 164	85 / 137	64 / 94
Hand Pos.	304 / 733	128 / 208	124 / 166	87 / 117
Head Rot.	16 / 35	12 / 18	11 / 19	11 / 17
Count	2,037	887	1,136	1,066

Table 3: Per-class Average Prediction Error vs. Motion Change Amplitude across different activity categories.

Time Step	t+1	t+3	t+5	t+7	t+10	Mean
PA-MPJPE	29	48	60	68	78	59
Head Pos.	16	54	94	136	200	106
Gaze Pos.	26	69	112	156	226	124
Hand Pos.	61	130	181	228	294	188
Head Rot.	2.5	7.6	12.3	16.7	23.2	13.2

Table 4: Per-Step Performance vs. Mean Performance.

Appendix C Per-Class Performance Analysis

Table 3 presents the per-step prediction performance of our model across different skilled activities. The results reveal a strong correlation between motion variability and prediction difficulty, with activities exhibiting larger motion change amplitudes ( $\Delta_{Avg}$ ) leading to higher errors. Structured tasks like Cooking and Health yield lower errors, while dynamic or fine-grained activities such as Basketball and fixing Bike introduce more uncertainty due to abrupt gaze shifts and complex hand-eye coordination. These findings highlight the challenge of modeling high-motion scenarios, suggesting that improving robustness in such conditions is crucial for advancing visuomotor prediction models.

Appendix D Per-Step Performance Analysis.

Table 4 presents the per-step prediction performance of our model across different future time steps. While the diffusion model generates the entire trajectory at once, errors increase over longer horizons due to growing uncertainty in future motion. At $t+1$ , predictions are highly accurate, with head position error at 16 and head rotation at 2.5 degrees. However, as the time step extends, errors grow significantly, reaching 200 for head position and 226 for gaze position at $t+10$ , reflecting the increasing difficulty of modeling long-range dependencies where future states become less constrained by recent observations.

Hand motion exhibits the largest variation, with error rising from 61 at $t+1$ to 294 at $t+10$ , suggesting that fine-grained hand movements are harder to predict due to their higher variability and dependence on external factors. In contrast, head rotation remains more stable, increasing gradually to 23.2 at $t+10$ , indicating that head orientation follows smoother, more predictable patterns. These results suggest that incorporating trajectory-level constraints or enhancing long-range temporal dependencies could improve long-horizon stability, particularly for hand motion.