Can Video Diffusion Model Reconstruct 4D Geometry ?

Jinjie Mai Wenxuan Zhu Haozhe Liu Bing Li Cheng Zheng Jürgen Schmidhuber Bernard Ghanem
King Abdullah University of Science and Technology (KAUST)
{jinjie.mai,bernard.ghanem}@kaust.edu.sa
Project Page

Abstract

Reconstructing dynamic 3D scenes (i.e., 4D geometry) from monocular video is an important yet challenging problem. Conventional multiview geometry-based approaches often struggle with dynamic motion, whereas recent learning-based methods either require specialized 4D representation or sophisticated optimization. In this paper, we present Sora3R, a novel framework that taps into the rich spatiotemporal priors of large-scale video diffusion models to directly infer 4D pointmaps from casual videos. Sora3R follows a two-stage pipeline: (1) we adapt a pointmap VAE from a pretrained video VAE, ensuring compatibility between the geometry and video latent spaces; (2) we finetune a diffusion backbone in combined video and pointmap latent space to generate coherent 4D pointmaps for every frame. Sora3R operates in a fully feedforward manner, requiring no external modules (e.g., depth, optical flow, or segmentation) or iterative global alignment. Extensive experiments demonstrate that Sora3R reliably recovers both camera poses and detailed scene geometry, achieving performance on par with state-of-the-art methods for dynamic 4D reconstruction across diverse scenarios.

1 Introduction

Refer to caption — Figure 1: Training Pipeline During training, with pretrained video VAE encoder ( $\mathcal{E}_{\text{RGB}}$ ) and pointmap VAE encoder ( $\mathcal{E}_{\text{XYZ}}$ ), video latent and noisy pointmap latent are concatenated for training latent diffusion transformer with denoising loss.

The ability to capture and reconstruct detailed 3D structures from visual data has long been a cornerstone of computer vision, powering critical tasks in robotics, autonomous driving, augmented reality (AR), and virtual reality (VR). Traditional multiview geometry-based frameworks [133, 47] for Simultaneous Localization and Mapping (SLAM), Structure-from-Motion (SfM), and 3D reconstruction have matured into robust pipelines like COLMAP [116] and ORB-SLAM [93, 19], offering reliable camera pose estimation together with sparse or semi-dense maps of static scenes. Despite their enduring success, these methods often struggle with dynamic objects or scenes, prompting the community to filter out dynamic and non-rigid components through motion segmentation [45, 7] or other heuristics [24, 176, 177]. As the demand for video-based understanding grows, there is increasing interest in not only achievingdense geometric reconstructions but also advancing toward 4D (3D + temporal 1D) modeling — capturing both spatial structure and temporal dynamics within a scene.

Such 4D reconstruction is critical for many real-world applications abound: robotics platforms [92] increasingly train in virtual environments, AR/VR [34] users seek to teleport dynamic real-world scenes into digital spaces, and content creators aspire to manipulate 4D assets for visual effects or physical consistent interaction [1]. While existing research has progressed considerably, many approaches either (a) [104, 149, 76, 165, 98, 118, 73, 175] reconstruct various 4D representations like 4D NeRF [91] or 4D Gaussian Splatting [62] from video synthesis; and (b) [176, 70, 24, 67] learn from geometry supervision signals like depth, optical flow, point tracks, or camera pose.

Recently, DUSt3R [143] opened a new chapter by introducing pointmap regression from pairwise images, demonstrating a promising way forward for dense 3D reconstruction: each pixel across input views is associated with a 3D coordinate in a unified world frame, which inspires many variants and follow-ups [82, 170, 17, 157, 137, 142, 68]. WVD [174] leverages video diffusion to generate both videos and pointmaps jointly but is restricted to only static scenes. Among them, MonST3R [170] advocates a temporal extension, predicting 4D pointmaps for a pair of video frames from dynamic scenes. However, training such a method still demands many high-quality 4D data, which is hard to get [57], and, in practice, depends on strong auxiliary modules [64, 145] plus lengthy post-optimization for global refinement.

In parallel, video diffusion models (VDMs) [50, 113, 13, 22, 44], trained at scale on vast unlabeled videos, have shown remarkable generative abilities, capturing not only the photometric properties of scenes but also coherent physical dynamics [65, 90, 12, 112, 99]. Recently, Marigold [61] has shown that DM could be repurposed for single image depth estimation by sharing a common latent space between the image and depth domains, while [53, 117] further extends the idea on VDMs but only for video depth estimation. Building on this insight, such findings underscore a natural yet interesting question: can these general-purpose video diffusion backbones be leveraged to reconstruct full 4D geometry directly, obviating the need for massive annotated 4D datasets or cumbersome optimization pipelines?

To this end, we affirm this question by proposing a novel approach, Sora3R, to marry the learned dynamic “world knowledge” of video diffusion models [12] with 4D pointmap representation. Specifically, we introduce a two-stage framework that bridges the gap between pointmap regression and generative video models. First, we finetune a specialized pointmap VAE from a pretrained video VAE to preserve latent-space compatibility, addressing the inherent distributional mismatch between video frames and 4D geometry. Then, we finetune a transformer-based diffusion backbone [179] within the combined latent space of video and pointmap. The proposed Sora3R is an efficient, feedforward framework that predicts coherent 4D pointmaps for all frames from a monocular video — requiring no external segmentation, optical flow, or iterative global alignment. We summarize our contributions in three folds:

•

We adapt a pointmap VAE from a pretrained video VAE to encode latent 4D geometry while maintaining consistency for DiT learning.
•

We present Sora3R, a novel video-diffusion pipeline that directly infers 4D pointmaps from a monocular video, enabling dynamic scene geometry reconstruction.
•

We demonstrate that Sora3R efficiently recovers both camera poses and dense scene structures without complex optimization, paving the way for many possible downstream 3D/4D tasks in dynamic real-world environments.

2 Related works

Video Diffusion Models. Diffusion models (DM) [50, 95, 54], designed to recover data from noise incrementally, have been validated as scalable solutions for large generative systems, including multimodal image generation [113, 31, 109, 114, 29, 155, 154, 25, 86, 38] and multimodal video generation [22, 13, 44, 90, 65, 27, 167, 8, 36, 139, 41, 150, 51, 77, 162, 23, 66]. In image generation tasks, the model produces an image aligned with the input text or class label. For video generation, temporal layers are added to the image-based DM, allowing the model to generate a video based on a text prompt, an image, or both. Recent diffusion models primarily adopt the latent diffusion model architecture [113, 8], where a VAE-based compressor [63, 35] embeds pixel-space values into a highly compressed latent code, and a diffusion model learns within this latent space. Empirical studies [22, 13, 44, 65, 127] indicate that, in this way, DM is able to learn from web-scale video data, and consequently demonstrate intriguing properties, such as capturing fundamental physical principles from videos. This has sparked interest in exploring the practical efficacy of video models in simulators [13, 46, 135, 14]. While concurrent works [2, 20] are studying the interplay between 3D/4D and video models through representation [108, 9] and MAE [132], our work adopts the video diffusion model backbone, advancing beyond generative video methods to 4D reconstruction.

3D and 4D Diffusion Models. Generation tasks witness most of the application of 3D diffusion models [153, 131, 10, 42, 136, 173, 26, 58, 85], common approaches of which including SDS [102, 75, 152, 107] optimization or feed-forward inference [79, 78, 120, 148, 125, 52, 172]. Many of them [80, 119, 83, 141, 89, 105, 76, 115, 165, 56] are able to generate unseen parts given single-view or sparse-view observations. Most recent dynamic 3D generation [151, 156, 166, 55] works have introduced the concept of 4D diffusion into the community. Some of them inject camera pose condition [72, 5, 99, 112, 49, 103, 4], pointmaps [174], object motion [106, 3, 40], or point tracking [15, 43] for controllable 4D consistency. Existing literatures adopt a wide range of 4D representations, such as 4D NeRF or gaussian splatting [91, 62, 181, 111, 121], multiview videos [69, 6, 74, 169], or deformable geometry [71]. Different from all of them, we adopt 4D pointmap from a reconstruction perspective as our representation. Also, unlike concurrent DM works on reconstruction tasks to solely predict camera pose [171, 138, 87], depth [61, 53, 117, 60], or optical flow [110], we predict 4D pointmap to reconstruct the full 4D geometry.

3D and 4D Reoncstruction. Classic vision-based methods for SLAM, SfM, and 3D reconstruction, based on multiview geometry [133, 47] have been studied for decades. Popular frameworks [18, 97] like COLMAP [116] and ORB-SLAM [93, 19] have been the cornerstone for many downstream tasks and applications. However, when extending to 4D reconstruction, they often filter out the dynamic components through motion segmentation [177, 45, 7] to ensure geometry consistency, reconstructing static scenes only. The bloom of learning-based methods brings many more new solutions for 3D and 4D reconstruction, especially dense mapping and reconstruction. DROID-SLAM [128, 130, 70] predicts depth along with deep bundle adjustment. Similarly, FlowMap [122, 123] performs first-order optimization through vanilla gradient descent under optical flow guidance [129]. Other methods [147, 134, 88, 182, 168] adopt view synthesis as the optimization objective while [21, 159] treat it as learning objective. Feedforward methods [175, 141, 176, 163, 118, 73, 39] learning from different supervision predict reconstruction end-to-end. Tracking-based methods [177, 96, 140, 98, 160] like LEAP-VO [24] learn reconstruction from static or dynamic point tracks. ACE-Zero [11] re-formulates the reconstruction problem as scene coordinate regression. Recently, DUSt3R [143] inspires many follow-up “3R” works towards different directions, such as video depth estimation [84], stateful reconstruction [137, 142], matching [68], SLAM/SfM [82, 33, 94], visual localization [32], multiview reconstruction [157, 126, 17] and 4D reconstruction [170, 142, 157, 57]. We focus on dynamic reconstruction with pointmap representation as MonST3R [170], but we adopt orthogonal video diffusion pipelines without lengthy global alignment and waive the dependencies on strong segmentation [64] and optical flow [145] modules.

3 Method

Overview. We describe our model design and training in detail in Sec. 3.1 and Sec. 3.2, as shown in Fig. 1. We elaborate our model inference, i.e. the procedure for our model to generate 4D pointmaps, in Sec. 3.3 and Fig. 2. We illustrate our post-optimization, i.e. the process to infer intrinsic, extrinsic, and depth from 4D pointmaps, in Sec. 3.4.

3.1 Temporal pointmap latent and VAE

Formally, given raw video frame input $\mathbf{V}\in\mathbb{R}^{N\times H\times W\times C}$ , a typical pretrained temporal VAE with encoder $\mathcal{E}_{\text{RGB}}$ and decoder $\mathcal{D}_{\text{RGB}}$ learns to model the video latent distribution.

A pointmap [143] is a group of pixel-wise 3D coordinates for a given frame, establishing the one-to-one pixel-point association in the world frame (usually the first frame). Following the definition [143] and notation [174] of pointmaps, while images have three RGB color channels, we denote the pointmap data as XYZ since it has three spatial channels. We use interchangeable terms between temporal pointmaps and 4D (3D + temporal 1D) pointmaps.

In contrast to existing approaches [61, 53, 117, 174] that freeze pretrained VAEs without additional tuning, we argue that fine-tuning is essential when transferring from temporal RGB images to temporal pointmaps. Real-world depth values can be extremely wide-ranging or effectively unbounded, causing the normalized pointmaps to become imbalanced, poorly scaled, and difficult for an unmodified video VAE to encode and decode. To alleviate this input gap, we propose to learn the temporal pointmap latent that keeps close to video latent but has the ability to represent 4D geometry, as our implicit 4D representation for dynamic scene understanding.

In other words, our goal is to fine-tune RGB VAE $\{\mathcal{E}_{\text{RGB}},\mathcal{D}_{\text{RGB}}\}$ to get XYZ VAE model $\{\mathcal{E}_{\text{XYZ}},\mathcal{D}_{\text{XYZ}}\}$ . During the finetuning, we have known the groundtruth camera poses $\{\mathbf{T}_{i}\}_{i=1}^{N}$ , where $\mathbf{T}_{i}\in\mathfrak{se}(3)$ . We always perform preprocessing to guarantee that the first frame will be the coordinate frame, i.e., $\mathbf{T}_{1}=\mathbf{I}$ . With depth maps $\mathbf{D}\in\mathbb{R}^{NHW}$ and known intrinsic $\mathbf{K}$ we can easily obtain the global pointmap, i.e., the temporal XYZ frame $\mathbf{P}\in\mathbb{R}^{NHWC}$ , where:

\mathbf{P}_{i}(u,v)=\mathbf{T}_{i}\cdot\mathbf{K}^{-1}\begin{bmatrix}u\cdot\mathbf{D}_{i}(u,v)\\ v\cdot\mathbf{D}_{i}(u,v)\\ \mathbf{D}_{i}(u,v)\end{bmatrix},\quad\forall i\in\{1,2,\dots,N\}

(1)

We omit the homogeneous transform in Eq. 1 for brevity. Additionally, to normalize the wide-ranging scene coordinates, we apply a norm scale factor to the camera poses, depths, and pointmap. We set the norm factor as the average distance to the origin [143]:

\mathbf{P}_{i}(u,v)=\frac{\mathbf{P}_{i}(u,v)}{\frac{1}{N\cdot H\cdot W}\sum_{i=1}^{N}\sum_{u=1}^{H}\sum_{v=1}^{W}\|\mathbf{P}_{i}(u,v)\|}

(2)

However, we find that vanilla $L1$ reconstruction loss widely adopted in video VAEs lacks sensitivity to pointmap reconstruction precision, while $L2$ reconstruction loss will over-regularize the model on outlier points. Therefore, we reformulate Huber loss [48] as our XYZ VAE reconstruction loss:

\mathcal{L}_{\text{rec}}(\mathbf{\hat{P}},\mathbf{P})=\begin{cases}0.5\cdot\|\mathbf{\hat{P}}-\mathbf{P}\|_{2}^{2},&\text{if }\|\mathbf{\hat{P}}-\mathbf{P}\|_{2}<\beta\\ \|\mathbf{\hat{P}}-\mathbf{P}\|_{1}-0.5\cdot\beta,&\text{otherwise}\end{cases}

(3)

Note that $\mathcal{L}_{\text{rec}}$ is element-wise and only calculated on points with valid depths. Points in background regions like the sky with infinite depth will be masked. Finally, along with the standard Kullback-Leibler Divergence loss, we formulate the total training loss for our XYZ VAE as:

\mathcal{L}_{\{\mathcal{E}_{\text{XYZ}},\mathcal{D}_{\text{XYZ}}\}}=\mathcal{L}_{\text{rec}}+\lambda_{\text{KL}}\mathcal{L}_{\text{KL}}

(4)

3.2 4D Geometry DiT

Once our XYZ VAE is trained, we now repurpose pretrained video diffusion models for denoising our proposed temporal pointmap latent, similar to canonical RGB video latent. We adopt transformer-based architecture instead of UNet [113] for its scalability and transferability by the spatial-temporal attention mechanism [100]. Motivated by VDMs’ emergent ability to understand world physics and dynamics [12], we start our fine-tuning with the model pretrained on massive web-scale videos to benefit from the strong spatiotemporal priors, mitigating the need for large-scale high-quality 4D annotated datasets which are hard to get [57, 6].

Inspired by [61], as shown in Fig. 1, we first obtain RGB video latent $\textbf{H}_{RGB}$ and XYZ pointmap latent $\textbf{H}_{XYZ}$ from $\mathcal{E}_{\text{RGB}}$ and $\mathcal{E}_{\text{XYZ}}$ , respectively. We train the latent diffusion transformer with rectified flow [37, 81]. For $\textbf{H}_{XYZ}=\mathcal{E}_{\text{XYZ}}(\textbf{P})$ and a normalized time step $t\in[0,1]$ , the noised input $\textbf{H}_{XYZ}^{t}$ is sampled from a straight path between the target distribution and a standard normal distribution $\epsilon\sim\mathcal{N}(0,1)$ :

\textbf{H}_{XYZ}^{t}=t\textbf{H}_{XYZ}+(1-t)\epsilon.

(5)

Then, we adopt the 4D DiT model $\mathcal{F}$ to predict the velocity $\boldsymbol{\nu}_{\epsilon}=\textbf{H}_{XYZ}^{t=1}-\textbf{H}_{XYZ}^{t=0}$ , i.e., update $\mathcal{F}$ by minizing the learning objective:

\mathbb{E}_{t,\mathbf{H}_{XYZ},\textbf{H}_{RGB},\epsilon}||\mathcal{F}(\textbf{H}_{XYZ}^{t},\textbf{H}_{RGB},t)-\boldsymbol{\nu}_{\epsilon}||^{2}

(6)

Here, $\textbf{H}_{RGB}$ is concatenated with $\textbf{H}_{XYZ}$ , serving as an additional condition for the denoising process.

We hypothesize that since XYZ VAE is fine-tuned from RGB VAE, though XYZ and RGB maps are very different ways to represent the 4D scene, in the hidden space, $\textbf{H}_{RGB}$ and $\textbf{H}_{XYZ}$ should share some internal features that can express the scene patterns. As DiT was pretrained with RGB VAE before, as training on pointmap data proceeds, such conditioning can help DiT gradually adapt for denoising on 4D geometry latent domain through these bridging hidden representations.

3.3 4D Pointmap Inference

During sampling, we only need $\mathcal{E}_{\text{RGB}}$ and $\mathcal{D}_{\text{XYZ}}$ . As shown in Fig. 2, we sample random noise $\epsilon\sim\mathcal{N}(0,1)$ and concat it with video latent $\textbf{H}_{RGB}=\mathcal{E}_{\text{RGB}}(\textbf{V})$ . The denoised temporal pointmap latent $\textbf{H}_{XYZ}$ from DiT can finally be decoded to 4D pointmaps $\mathbf{\hat{P}}=\mathcal{D}_{\text{XYZ}}(\textbf{H}_{XYZ})$ .

The concatenated $\textbf{H}_{RGB}$ here not only serves as the denoising condition but also encodes rich spatiotemporal features from the video to help our DiT infer the 4D geometry. Since our method processes all the video frames all at once through diffusion, we can capture the global spatiotemporal dependencies to infer temporal consistent, and spatial coherent 4D pointmaps.

Name	Type	Camera Motion	Scene Motion	Sequences	Frames
DynamicReplica [59]	Synthetic, Indoor	Small	Moderate	483	144.9K
PointOdyssey [178]	Synthetic, Indoor & Outdoor	Various	Large	123	216.7K
Objaverse [30]	Synthetic, Objects	Moderate	Various	520	187.2K
TartanAir [144]	Synthetic, Indoor & Outdoor	Large	Static	369	306.6K
RealEstate10K [180]	Realworld, Indoor & Outdoor	Moderate	Static	3753	671.8K

Table 1: Training Data We collect five diverse datasets spanning synthetic and real scenes with varying camera and scene motions. The total duration of the total

1.5B

frames is around

14

hours.

	Sintel [16]			TUM-dynamics [124]			ScanNet (static) [28]
Method	ATE $\downarrow$	RPE trans $\downarrow$	RPE rot $\downarrow$	ATE $\downarrow$	RPE trans $\downarrow$	RPE rot $\downarrow$	ATE $\downarrow$	RPE trans $\downarrow$	RPE rot $\downarrow$
DUSt3R [143]	0.320	0.159	14.901	0.026	0.024	1.102	0.019	0.023	0.458
MonST3R [170]	0.155	0.064	1.739	0.009	0.010	0.725	0.011	0.011	0.315
Ours (Sora3r)	0.271	0.237	9.437	0.029	0.028	2.386	0.040	0.042	2.348

Table 2: Camera Pose Estimation Result. Note different datasets have different depth scales, resulting in different numeric magnitudes for the metrics. We highlight the 1st, 2nd, and 3rd best results.

3.4 Post-optimization for Downstream Tasks

As we always fix the first frame as world coordinate frame during training, the predicted 4D pointmaps $\mathbf{\hat{P}}$ are supposed to share the same coordinate system too. While $\mathbf{\hat{P}}$ itself can already serve as 4D representation of the scene, such representation makes it easy to support many geometry tasks with simple and straightforward post-optimization.

Intrinsic Estimation. Since our goal is monocular video reconstruction, it’s reasonable to assume that all the video frames coming from the same camera, i.e., share the same intrinsics. We set the principal point at the center of the frames. This implies:

c_{x}={W}/{2},\quad c_{y}={H}/{2}

(7)

Since we fixed the first frame as the coordinate frame during training, we imply $\mathbf{\hat{T}}_{1}=\mathbf{I}$ by definition. According to Eq. 1, we can solve the remaining focal $f$ for intrinsic $\mathbf{K}$ through optimization from $\mathbf{P}_{1}$ based on fast Weiszfeld algorithm [101]:

\begin{split}\hat{f}&=\arg\min_{f}\sum_{u=1}^{W}\sum_{v=1}^{H}\\ &\left\|(u-c_{x},v-c_{y})-f\frac{(\mathbf{P}_{1}(u,v,0),\mathbf{P}_{1}(u,v,1)}{\mathbf{P}_{1}(u,v,2)}\right\|,\end{split}

(8)

Camera Pose Estimation. After we obtain $\mathbf{\hat{T}}_{1}=\mathbf{I}$ and $\mathbf{\hat{K}}$ as above, we can easily infer the remaining camera poses from RANSAC PnP algorithm [47]:

\begin{split}\mathbf{\hat{T}}_{i}&=\arg\min_{\mathbf{\hat{T_{i}}}}\sum_{u=1}^{W}\sum_{v=1}^{H}\left\|(u,v)-\pi\left(\mathbf{K}\mathbf{\hat{T}}_{i}\mathbf{P}_{i}(u,v)\right)\right\|_{2}\\ &\text{subject to }\mathbf{\hat{T}}_{i}\in\mathfrak{se}(3),\quad\forall i\in\{1,2,\dots,N\}\end{split}

(9)

Video Depth Estimation. Based on Eq. 1, we can now easily get per-frame depth map estimation through simple pinhole projection:

\mathbf{\hat{D}}_{i}=\mathbf{\hat{K}}\mathbf{\hat{T}}\mathbf{P}_{i}

(10)

4 Implementation Details

4.1 Datasets

We present the training datasets in Tab. 1. For real-world datasets, we use a subset of RealEstate10K [180] datasets. As RealEstate10K only has camera pose annotations, we run COLMAP [116] and DepthAnythingV2 [158] to get the aligned depth maps with the poses and then pointmaps. Despite this, the resulting pointmaps remain somewhat noisy, so we filter the subset down to about $3,000$ sequences. Because real-world datasets often yield incomplete or noisy depth and pose estimates, we bias our training heavily toward synthetic datasets, which can provide accurate and complete pointmap for every rendered frame. Specifically, we collect four synthetic datasets in total covering indoor and outdoor scenes with varying camera and scene motions: DynamicReplica [59], Objaverse [30], PointOdyssey [178], and TartanAir [144]. For Objaverse, we randomly select $500$ objects spanning both dynamic and static categories, then render each with a random camera trajectory.

4.2 Model Architecture

We build Sora3R on top of OpenSora [179]. Specifically, we remove the discriminator and adversarial loss from the video VAE. Our VAE compresses the video with a ratio of $4\times 8\times 8$ . For the DiT, we remove the text encoder, text conditioning, and cross-attention. During initialization, we expand the DiT patchify layer by duplicating its weights to double the input channel dimension from 4 to 8, which allows us to tokenize $\textbf{H}_{RGB}$ and $\textbf{H}_{XYZ}$ jointly. We use bfloat16 precision for training and float16 precision for inference. We set the number of sampling steps as $100$ during the sampling at inference.

4.3 Training Protocol

We fix the spatiotemporal resolution as $17\times 384\times 512$ for all video clips, with random temporal stride in $[1,3]$ , following the common practice in video VAEs [164]. Both our XYZ VAE and DiT models are trained in two stages. First stage: we warm up on RealEstate10K for $4$ epochs with a learning rate of $1\times 10^{-5}$ . Although RealEstate10K data are large but somewhat noisy, it provides moderate camera motion in static scenes, making it suitable for initial pretraining. Second stage: we fine-tune exclusively on synthetic datasets with a reduced learning rate of $5\times 10^{-6}$ . The reason is that synthetic data offer perfect depth and camera poses, which ensure high-quality supervision for the final-stage finetuning. Each fine-tuning stage only takes about $48$ hours on $4$ Nvidia A100 GPUs.

4.4 Evaluation Protocol

We evaluate our Sora3R on three common but challenging unseen datasets in our training: Sintel [16], TUM-dynamics [124], and ScanNet [28], which also spanning from static and dynamic scenes and covering from synthetic and real-world data with diverse camera motions. For each dataset, we sample 50 sequences at uniform intervals, each with a $17\times 384\times 512$ resolution and a temporal stride of 2. Since there are no established metrics to evaluate 4D pointmaps, following MonST3R [170], we conduct camera pose estimation and video depth estimation to evaluate our 4D geometry. For the depth evaluation, we apply the same scale and shift alignment used in MonST3R. For the pose evaluation, we apply a $\mathfrak{sim}(3)$ Umeyama alignment to match predictions with ground truth trajectories. We also report the same metric as [170]: Absolute Relative Error (Abs Rel), percentage of inlier points $\delta<1.25$ , Absolute Translation Error (ATE), Relative Translation Error (RPE trans), and Relative Rotation Error (RPE rot).

5 Result

5.1 Camera Pose Estimation

As shown in Tab. 2, overall, our Sora3R demonstrates robust performance across all benchmarks but does not yet surpass the strongest baselines MonST3R, falling behind by a notable performance gap. For ScanNet (static), DUSt3R and MonST3R clearly lead the table, for instance, DUSt3R achieves $0.019$ ATE v.s. our $0.040$ . One reason is that DUSt3R trains on large-scale real-world indoor datasets like ScanNet++ [161], whereas our real-world pretraining only involves noisy depth and poses from RealEstate10K. Despite these gaps in static environments, Sora3R remains competitive for dynamic cases. On Sintel, we outperform DUSt3R on ATE with $0.271$ v.s. $0.320$ , on RPE rot. $9.437$ v.s. $14.901$ . We also achieve a comparable ATE on TUM-dynamics ( $0.029$ v.s. $0.026$ ), although our overall performance there is not as strong as on Sintel, likely due to TUM-dynamics’ smaller human motion but extensive static real-world indoor layout. It’s worth noting that both DUSt3R and MonST3R employ global alignment and pairwise graph optimizations, further refining their final pose accuracy.

5.2 Video Depth Estimation

Tab. 3 summarizes our video depth estimation results. ChronoDepth [117] and DepthCrafter [53] have pipelines similar to ours but are dedicated to repurposing VDMs for video depth estimation. Despite relying on PnP‐estimated poses from pointmaps (which introduces additional noise), we still outperform ChronoDepth and DepthCrafter in certain settings. For instance, on TUM‐dynamics, our method achieves an Abs Rel of $0.211$ v.s. $0.340$ and $0.269$ , and a $\delta<1.25$ of $0.742$ v.s. $0.446$ and $0.597$ , respectively. We also see gains on ScanNet, where our $\delta<1.25$ is $0.823$ v.s. $0.641$ and $0.688$ , and our Abs Rel of $0.285$ is lower than DepthCrafter’s $0.344$ . We attribute these improvements to our fine‐tuning strategy, in which we adapt the video VAE specifically for understanding 4D geometry instead of relying on an unmodified video‐only VAE. Nevertheless, our approach still falls short of DUSt3R [143] and MonST3R [170], especially in static scenes. One likely reason is their use of confidence‐map mechanisms that help avoid extreme depth values—such as those in background regions—resulting in smoother depth maps with lower variance (Fig. 3).

Method	Sintel [16]		TUM-dynamics [124]		ScanNet (static) [28]
Method	Abs Rel $\downarrow$	$\delta<1.25$ $\uparrow$	Abs Rel $\downarrow$	$\delta<1.25$ $\uparrow$	Abs Rel $\downarrow$	$\delta<1.25$ $\uparrow$
ChronoDepth [117]	0.499	0.342	0.340	0.446	0.210	0.641
DepthCrafter [53]	1.282	0.450	0.269	0.597	0.344	0.688
DUSt3R [143]	0.429	0.539	0.168	0.808	0.026	0.990
MonST3R [170]	0.338	0.599	0.136	0.823	0.031	0.984
Ours (Sora3r)	0.544	0.435	0.211	0.742	0.285	0.823

Table 3: Video Depth Estimation Result We highlight the 1st, 2nd, and 3rd best results.

5.3 Qualitative Results

We visualize the recovered depth maps and camera trajectories in Fig. 3. From the visualized depth maps, we find that our Sora3R tends to predict depth maps with wider value ranges, which actually align with the groundtruth distribution, while DUSt3R and MonST3R tend to predict smoother depth maps. For the depth quality, though DUSt3R and MonST3R have shown better quality for static structures, our method also demonstrates the ability to reconstruct both scene structure and motions. About the pose quality, though our pose recovered from 4D pointmaps by PnP is still noisy, it has understood the general camera motion. The visualization verified that VDMs can indeed have a sense of the world coordinate, and be able to understand scene dynamics as well as camera motion.

Method	Input	Pose			Video Depth
Method	Input	ATE $\downarrow$	RPE trans $\downarrow$	RPE rot $\downarrow$	Abs Rel $\downarrow$	$\delta<1.25$ $\uparrow$
(a) VAE_RGB	GT Pointmap	0.042	0.036	34.223	0.584	0.641
(b) VAE_RGB→XYZ	GT Pointmap	0.011	0.011	1.605	0.152	0.916
(c) VAE_RGB + DiT_RGB→XYZ	Video	0.267	0.099	135.739	1.407	0.116
(d) VAE_{SCRATCH→XYZ} + DiT_{SCRATCH→XYZ}	Video	0.299	0.112	133.342	1.279	0.144
(e) VAE_RGB→XYZ + DiT_RGB→XYZ (Sora3R)	Video	0.271	0.237	9.437	0.544	0.435
(f) MonST3R	Video	0.155	0.064	1.739	0.338	0.599

Table 4: Ablation Study on Sintel Dataset. The subscript

RGB

denotes model pretrained on videos;

RGB\rightarrow XYZ

denotes the model pretrained on videos and finetuned on pointmaps;

SCRATCH\rightarrow XYZ

denotes the model initialized from scratch and trained on pointmaps. We highlight the 1st, 2nd, and 3rd best results.

5.4 Ablation Study

5.4.1 Importance of 4D Pointmap Learning

We conduct an extensive ablation study on the Sintel dataset in Tab. 4 to analyze the effectiveness of our approach.

In experiment (a), the pretrained video VAE_RGB, evaluated with ground-truth pointmaps, shows notably weaker performance than experiment (b), where VAE_RGB→XYZ is fine-tuned on pointmap data. A similar conclusion can be obtained by comparing experiment (c), fine-tuning DiT_RGB→XYZ on frozen video VAE_RGB, and experiment (e), jointly fine-tuning both the VAE_RGB→XYZ and the DiT_RGB→XYZ on pointmaps (RPE rot. $135.739$ v.s. $9.437$ ). This considerable gap clearly illustrates that pretrained video VAEs, when remain frozen, struggle with encoding and decoding 4D pointmaps due to the inherent imbalance and scale variance. Our fine-tuning approach effectively bridges this gap, showing that 4D latent domain adaptation is crucial for learning accurate 4D geometry.

Furthermore, experiment (d), where both VAE_{SCRATCH→XYZ} and DiT_SRATCH→XYZ are trained from scratch on pointmaps, and (e) Sora3R, validates the central motivation behind our approach (Abs Rel $1.407$ v.s. $0.544$ ): the pretrained video latent representations encode valuable dynamic and geometric priors, and general-purpose video diffusion backbones can reconstruct 4D geometry with proper latent domain adaption and tuning.

Interestingly, in experiment (b), VAE_RGB→XYZ, which serves as a theoretical upper bound for our diffusion model, clearly outperforms the strongest baseline MonST3R. We hope future research along this line can further narrow this performance gap.

5.4.2 Inference Efficiency

We compare the inference efficiency in Fig. 4. With spatiotemporal resolution as $17\times 384\times 512$ , we have a similar model inference time as MonST3R. However, our model inference time could be easily further reduced by adopting video VAE with a higher compression ratio or decreasing sampling steps, while DuSt3R and MonST3R have to process multiple pairs depending on sequence length. Additionally, since Sora3R predicts 4D pointmaps for all the video frames in the world frame all at once, we simply perform feedforward post-optimization to get pose and depth ( $1.5s$ v.s. $\sim 30s$ ), while the baselines all need lengthy iterative global alignment and optimization, as their predicted pair-wise pointmaps are in different coordinate systems.

5.5 Limitations

The main concern of our method is that the performance still falls short of state-of-the-art 4D geometry reconstruction methods [170]. A common failure case, for example, as shown in Fig. 5, when our model predicts imbalanced wide-ranging pointmaps, it will severely affect the robustness of camera pose estimation. Another limitation is that the current model inference for a fixed number of frames in a temporal window. However, we believe that by introducing proper regularization and scaling up training data with stronger video VAE and diffusion backbones, which can now generate minute-long videos [146, 12], reconstructing in 4D latent space with rich spatiotemporal priors can be an efficient and promising future direction to scale up 4D reconstruction for longer videos.

6 Conclusion

In this paper, we propose Sora3R, a novel framework that repurposes video diffusion models (VDMs) for 4D geometry reconstruction from monocular video. By fine-tuning a specialized pointmap VAE and a transformer-based diffusion backbone, our method bridges the gap between pointmap regression and generative video modeling, effectively adapting the rich spatiotemporal latent priors for 4D reconstruction. Although our current implementation exhibits a performance gap to state-of-the-art methods, through extensive experiments, we demonstrate that VDMs possess the “world knowledge” to capture physical dynamics and reconstruct dynamic 3D scenes in 4D pointmap latent space. This demonstrates the potential of a generative model-based pipeline capable of handling both static and dynamic reconstruction without relying on external modules or sophisticated optimization. We hope that our work not only validates the potential of repurposing VDMs for 4D reconstruction but also inspires and catalyzes further research in dynamic scene reconstruction.

Acknowledgement. This publication was supported by funding from KAUST Center of Excellence on GenAI under award number 5940, as well as, the SDAIA-KAUST Center of Excellence in Data Science and Artificial Intelligence. For computer time, this research used Ibex managed by KAUST Supercomputing Core Laboratory.

References

Authors [2024] Genesis Authors. Genesis: A universal and generative physics engine for robotics and beyond, 2024.
Badki et al. [2025] Abhishek Badki, Hang Su, Bowen Wen, and Orazio Gallo. L4p: Low-level 4d vision perception unified. arXiv preprint arXiv:2502.13078, 2025.
Bahmani et al. [2024a] Sherwin Bahmani, Xian Liu, Wang Yifan, Ivan Skorokhodov, Victor Rong, Ziwei Liu, Xihui Liu, Jeong Joon Park, Sergey Tulyakov, Gordon Wetzstein, et al. Tc4d: Trajectory-conditioned text-to-4d generation. In European Conference on Computer Vision, pages 53–72. Springer, 2024a.
Bahmani et al. [2024b] Sherwin Bahmani, Ivan Skorokhodov, Aliaksandr Siarohin, Willi Menapace, Guocheng Qian, Michael Vasilkovsky, Hsin-Ying Lee, Chaoyang Wang, Jiaxu Zou, Andrea Tagliasacchi, et al. Vd3d: Taming large video diffusion transformers for 3d camera control. arXiv preprint arXiv:2407.12781, 2024b.
Bahmani et al. [2025] Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B. Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion transformers. Proc. CVPR, 2025.
Bai et al. [2024] Jianhong Bai, Menghan Xia, Xintao Wang, Ziyang Yuan, Xiao Fu, Zuozhu Liu, Haoji Hu, Pengfei Wan, and Di Zhang. Syncammaster: Synchronizing multi-camera video generation from diverse viewpoints. arXiv preprint arXiv:2412.07760, 2024.
Bescos et al. [2018] Berta Bescos, José M Fácil, Javier Civera, and José Neira. Dynaslam: Tracking, mapping, and inpainting in dynamic scenes. IEEE robotics and automation letters, 3(4):4076–4083, 2018.
Blattmann et al. [2023] Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023.
Bond et al. [2025] Andrew Bond, Jui-Hsien Wang, Long Mai, Erkut Erdem, and Aykut Erdem. Gaussianvideo: Efficient video representation via hierarchical gaussian splatting. arXiv preprint arXiv:2501.04782, 2025.
Boss et al. [2024] Mark Boss, Zixuan Huang, Aaryaman Vasishta, and Varun Jampani. Sf3d: Stable fast 3d mesh reconstruction with uv-unwrapping and illumination disentanglement. arXiv preprint, 2024.
Brachmann et al. [2024] Eric Brachmann, Jamie Wynn, Shuai Chen, Tommaso Cavallari, Áron Monszpart, Daniyar Turmukhambetov, and Victor Adrian Prisacariu. Scene coordinate reconstruction: Posing of image collections via incremental learning of a relocalizer. In European Conference on Computer Vision, pages 421–440. Springer, 2024.
Brooks et al. [2024a] Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024a.
Brooks et al. [2024b] Tim Brooks, Bill Peebles, Connor Homes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators, 2024. URL https://openai.com/research/video-generation-models-as-world-simulators, 2024b.
Bruce et al. [2024] Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. In Forty-first International Conference on Machine Learning, 2024.
Burgert et al. [2025] Ryan Burgert, Yuancheng Xu, Wenqi Xian, Oliver Pilarski, Pascal Clausen, Mingming He, Li Ma, Yitong Deng, Lingxiao Li, Mohsen Mousavi, et al. Go-with-the-flow: Motion-controllable video diffusion models using real-time warped noise. arXiv preprint arXiv:2501.08331, 2025.
Butler et al. [2012] Daniel J Butler, Jonas Wulff, Garrett B Stanley, and Michael J Black. A naturalistic open source movie for optical flow evaluation. In Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part VI 12, pages 611–625. Springer, 2012.
Cabon et al. [2025] Yohann Cabon, Lucas Stoffl, Leonid Antsfeld, Gabriela Csurka, Boris Chidlovskii, Jerome Revaud, and Vincent Leroy. Must3r: Multi-view network for stereo 3d reconstruction, 2025.
Cadena et al. [2016] Cesar Cadena, Luca Carlone, Henry Carrillo, Yasir Latif, Davide Scaramuzza, José Neira, Ian Reid, and John J Leonard. Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age. IEEE Transactions on robotics, 32(6):1309–1332, 2016.
Campos et al. [2021] Carlos Campos, Richard Elvira, Juan J Gómez Rodríguez, José MM Montiel, and Juan D Tardós. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE transactions on robotics, 37(6):1874–1890, 2021.
Carreira et al. [2024] João Carreira, Dilara Gokay, Michael King, Chuhan Zhang, Ignacio Rocco, Aravindh Mahendran, Thomas Albert Keck, Joseph Heyward, Skanda Koppula, Etienne Pot, et al. Scaling 4d representations. arXiv preprint arXiv:2412.15212, 2024.
Charatan et al. [2024] David Charatan, Sizhe Lester Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19457–19467, 2024.
Chen et al. [2024a] Shoufa Chen, Mengmeng Xu, Jiawei Ren, Yuren Cong, Sen He, Yanping Xie, Animesh Sinha, Ping Luo, Tao Xiang, and Juan-Manuel Perez-Rua. Gentron: Diffusion transformers for image and video generation. In CVPR, 2024a.
Chen et al. [2025a] Shoufa Chen, Chongjian Ge, Yuqi Zhang, Yida Zhang, Fengda Zhu, Hao Yang, Hongxiang Hao, Hui Wu, Zhichao Lai, Yifei Hu, et al. Goku: Flow based video generative foundation models. arXiv preprint arXiv:2502.04896, 2025a.
Chen et al. [2024b] Weirong Chen, Le Chen, Rui Wang, and Marc Pollefeys. Leap-vo: Long-term effective any point tracking for visual odometry. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19844–19853, 2024b.
Chen et al. [2025b] Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811, 2025b.
Chen et al. [2024c] Zilong Chen, Yikai Wang, Feng Wang, Zhengyi Wang, and Huaping Liu. V3d: Video diffusion models are effective 3d generators, 2024c.
Cong et al. [2024] Yuren Cong, Mengmeng Xu, Christian Simon, Shoufa Chen, Jiawei Ren, Yanping Xie, Juan-Manuel Perez-Rua, Bodo Rosenhahn, Tao Xiang, and Sen He. FLATTEN: optical flow-guided attention for consistent text-to-video editing. In ICLR, 2024.
Dai et al. [2017] Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017.
Dai et al. [2023] Xiaoliang Dai, Ji Hou, Chih-Yao Ma, Sam Tsai, Jialiang Wang, Rui Wang, Peizhao Zhang, Simon Vandenhende, Xiaofang Wang, Abhimanyu Dubey, et al. Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807, 2023.
Deitke et al. [2023] Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram Voleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects. Advances in Neural Information Processing Systems, 36:35799–35813, 2023.
Dhariwal and Nichol [2021] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. In NIPS, 2021.
Dong et al. [2024] Siyan Dong, Shuzhe Wang, Shaohui Liu, Lulu Cai, Qingnan Fan, Juho Kannala, and Yanchao Yang. Reloc3r: Large-scale training of relative camera pose regression for generalizable, fast, and accurate visual localization. arXiv preprint arXiv:2412.08376, 2024.
Duisterhof et al. [2024] Bardienus Duisterhof, Lojze Zust, Philippe Weinzaepfel, Vincent Leroy, Yohann Cabon, and Jerome Revaud. Mast3r-sfm: a fully-integrated solution for unconstrained structure-from-motion. arXiv preprint arXiv:2409.19152, 2024.
Engel et al. [2023] Jakob Engel, Kiran Somasundaram, Michael Goesele, Albert Sun, Alexander Gamino, Andrew Turner, Arjang Talattof, Arnie Yuan, Bilal Souti, Brighid Meredith, Cheng Peng, Chris Sweeney, Cole Wilson, Dan Barnes, Daniel DeTone, David Caruso, Derek Valleroy, Dinesh Ginjupalli, Duncan Frost, Edward Miller, Elias Mueggler, Evgeniy Oleinik, Fan Zhang, Guruprasad Somasundaram, Gustavo Solaira, Harry Lanaras, Henry Howard-Jenkins, Huixuan Tang, Hyo Jin Kim, Jaime Rivera, Ji Luo, Jing Dong, Julian Straub, Kevin Bailey, Kevin Eckenhoff, Lingni Ma, Luis Pesqueira, Mark Schwesinger, Maurizio Monge, Nan Yang, Nick Charron, Nikhil Raina, Omkar Parkhi, Peter Borschowa, Pierre Moulon, Prince Gupta, Raul Mur-Artal, Robbie Pennington, Sachin Kulkarni, Sagar Miglani, Santosh Gondi, Saransh Solanki, Sean Diener, Shangyi Cheng, Simon Green, Steve Saarinen, Suvam Patra, Tassos Mourikis, Thomas Whelan, Tripti Singh, Vasileios Balntas, Vijay Baiyya, Wilson Dreewes, Xiaqing Pan, Yang Lou, Yipu Zhao, Yusuf Mansour, Yuyang Zou, Zhaoyang Lv, Zijian Wang, Mingfei Yan, Carl Ren, Renzo De Nardi, and Richard Newcombe. Project aria: A new tool for egocentric multi-modal ai research, 2023.
Esser et al. [2021] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021.
Esser et al. [2023] Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. In ICCV, 2023.
Esser et al. [2024] Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024.
Fan et al. [2024a] Lijie Fan, Tianhong Li, Siyang Qin, Yuanzhen Li, Chen Sun, Michael Rubinstein, Deqing Sun, Kaiming He, and Yonglong Tian. Fluid: Scaling autoregressive text-to-image generative models with continuous tokens. arXiv preprint arXiv:2410.13863, 2024a.
Fan et al. [2024b] Zhiwen Fan, Jian Zhang, Wenyan Cong, Peihao Wang, Renjie Li, Kairun Wen, Shijie Zhou, Achuta Kadambi, Zhangyang Wang, Danfei Xu, Boris Ivanovic, Marco Pavone, and Yue Wang. Large spatial model: End-to-end unposed images to semantic 3d, 2024b.
Fu et al. [2024] Xiao Fu, Xian Liu, Xintao Wang, Sida Peng, Menghan Xia, Xiaoyu Shi, Ziyang Yuan, Pengfei Wan, Di Zhang, and Dahua Lin. 3dtrajmaster: Mastering 3d trajectory for multi-entity motion in video generation. arXiv preprint arXiv:2412.07759, 2024.
Gao et al. [2024a] Peng Gao, Le Zhuo, Ziyi Lin, Chris Liu, Junsong Chen, Ruoyi Du, Enze Xie, Xu Luo, Longtian Qiu, Yuhang Zhang, et al. Lumina-t2x: Transforming text into any modality, resolution, and duration via flow-based large diffusion transformers. arXiv preprint arXiv:2405.05945, 2024a.
Gao et al. [2024b] Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul Srinivasan, Jonathan T Barron, and Ben Poole. Cat3d: Create anything in 3d with multi-view diffusion models. arXiv preprint arXiv:2405.10314, 2024b.
Geng et al. [2024] Daniel Geng, Charles Herrmann, Junhwa Hur, Forrester Cole, Serena Zhang, Tobias Pfaff, Tatiana Lopez-Guevara, Carl Doersch, Yusuf Aytar, Michael Rubinstein, et al. Motion prompting: Controlling video generation with motion trajectories. arXiv preprint arXiv:2412.02700, 2024.
Girdhar et al. [2023] Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023.
Goli et al. [2024] Lily Goli, Sara Sabour, Mark Matthews, Brubaker Marcus, Dmitry Lagun, Alec Jacobson, David J. Fleet, Saurabh Saxena, and Andrea Tagliasacchi. RoMo: Robust motion segmentation improves structure from motion. arXiv:2411.18650, 2024.
Ha and Schmidhuber [2018] David Ha and Jürgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018.
Hartley and Zisserman [2003] Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003.
Hastie et al. [2009] Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. The elements of statistical learning: data mining, inference, and prediction. Springer, 2009.
He et al. [2024] Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation. arXiv preprint arXiv:2404.02101, 2024.
Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NIPS, 2020.
Ho et al. [2022] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
Hong et al. [2023] Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d. arXiv preprint arXiv:2311.04400, 2023.
Hu et al. [2024] Wenbo Hu, Xiangjun Gao, Xiaoyu Li, Sijie Zhao, Xiaodong Cun, Yong Zhang, Long Quan, and Ying Shan. Depthcrafter: Generating consistent long depth sequences for open-world videos. arXiv preprint arXiv:2409.02095, 2024.
Jarzynski [1997] Christopher Jarzynski. Equilibrium free-energy differences from nonequilibrium measurements: A master-equation approach. Physical Review E, 1997.
Jiang et al. [2023] Yanqin Jiang, Li Zhang, Jin Gao, Weimin Hu, and Yao Yao. Consistent4d: Consistent 360 $\{$ $\backslash$ deg $\}$ dynamic object generation from monocular video. arXiv preprint arXiv:2311.02848, 2023.
Jin et al. [2024a] Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, and Zexiang Xu. Lvsm: A large view synthesis model with minimal 3d inductive bias, 2024a.
Jin et al. [2024b] Linyi Jin, Richard Tucker, Zhengqi Li, David Fouhey, Noah Snavely, and Aleksander Holynski. Stereo4d: Learning how things move in 3d from internet stereo videos. arXiv preprint arXiv:2412.09621, 2024b.
Junlin Han [2024] Philip Torr Junlin Han, Filippos Kokkinos. Vfusion3d: Learning scalable 3d generative models from video diffusion models. arXiv preprint arXiv:2403.12034, 2024.
Karaev et al. [2023] Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Dynamicstereo: Consistent dynamic depth from stereo videos. CVPR, 2023.
Ke et al. [2024a] Bingxin Ke, Dominik Narnhofer, Shengyu Huang, Lei Ke, Torben Peters, Katerina Fragkiadaki, Anton Obukhov, and Konrad Schindler. Video depth without video models. arXiv preprint arXiv:2411.19189, 2024a.
Ke et al. [2024b] Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, and Konrad Schindler. Repurposing diffusion-based image generators for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9492–9502, 2024b.
Kerbl et al. [2023] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 42(4):139–1, 2023.
Kingma and Welling [2013] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
Kirillov et al. [2023] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023.
Kondratyuk et al. [2023] Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Rachel Hornung, Hartwig Adam, Hassan Akbari, Yair Alon, Vighnesh Birodkar, et al. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023.
Kong et al. [2024] Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603, 2024.
Kopf et al. [2021] Johannes Kopf, Xuejian Rong, and Jia-Bin Huang. Robust consistent video depth estimation, 2021.
Leroy et al. [2024] Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r. In European Conference on Computer Vision, pages 71–91. Springer, 2024.
Li et al. [2024a] Bing Li, Cheng Zheng, Wenxuan Zhu, Jinjie Mai, Biao Zhang, Peter Wonka, and Bernard Ghanem. Vivid-zoo: Multi-view video generation with diffusion model, 2024a.
Li et al. [2024b] Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holynski, and Noah Snavely. Megasam: Accurate, fast, and robust structure and motion from casual dynamic videos. arXiv preprint arXiv:2412.04463, 2024b.
Li et al. [2025] Zhiqi Li, Yiming Chen, and Peidong Liu. Dreammesh4d: Video-to-4d generation with sparse-controlled gaussian-mesh hybrid representation. Advances in Neural Information Processing Systems, 37:21377–21400, 2025.
Liang et al. [2024a] Hanwen Liang, Junli Cao, Vidit Goel, Guocheng Qian, Sergei Korolev, Demetri Terzopoulos, Konstantinos N Plataniotis, Sergey Tulyakov, and Jian Ren. Wonderland: Navigating 3d scenes from a single image. arXiv preprint arXiv:2412.12091, 2024a.
Liang et al. [2024b] Hanxue Liang, Jiawei Ren, Ashkan Mirzaei, Antonio Torralba, Ziwei Liu, Igor Gilitschenski, Sanja Fidler, Cengiz Oztireli, Huan Ling, Zan Gojcic, and Jiahui Huang. Feed-forward bullet-time reconstruction of dynamic scenes from monocular videos, 2024b.
Liang et al. [2024c] Hanwen Liang, Yuyang Yin, Dejia Xu, Hanxue Liang, Zhangyang Wang, Konstantinos N Plataniotis, Yao Zhao, and Yunchao Wei. Diffusion4d: Fast spatial-temporal consistent 4d generation via video diffusion models. arXiv preprint arXiv:2405.16645, 2024c.
Lin et al. [2023] Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 300–309, 2023.
Liu et al. [2024a] Fangfu Liu, Wenqiang Sun, Hanyang Wang, Yikai Wang, Haowen Sun, Junliang Ye, Jun Zhang, and Yueqi Duan. Reconx: Reconstruct any scene from sparse views with video diffusion model. arXiv preprint arXiv:2408.16767, 2024a.
Liu et al. [2024b] Haozhe Liu, Shikun Liu, Zijian Zhou, Mengmeng Xu, Yanping Xie, Xiao Han, Juan C Pérez, Ding Liu, Kumara Kahatapitiya, Menglin Jia, et al. Mardini: Masked autoregressive diffusion for video generation at scale. arXiv preprint arXiv:2410.20280, 2024b.
Liu et al. [2023a] Minghua Liu, Ruoxi Shi, Linghao Chen, Zhuoyang Zhang, Chao Xu, Xinyue Wei, Hansheng Chen, Chong Zeng, Jiayuan Gu, and Hao Su. One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion. arXiv preprint arXiv:2311.07885, 2023a.
Liu et al. [2024c] Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Mukund Varma T, Zexiang Xu, and Hao Su. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. Advances in Neural Information Processing Systems, 36, 2024c.
Liu et al. [2023b] Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl Vondrick. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9298–9309, 2023b.
Liu et al. [2022] Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003, 2022.
Liu et al. [2024d] Yuzheng Liu, Siyan Dong, Shuzhe Wang, Yingda Yin, Yanchao Yang, Qingnan Fan, and Baoquan Chen. Slam3r: Real-time dense scene reconstruction from monocular rgb videos. arXiv preprint arXiv:2412.09401, 2024d.
Long et al. [2023] Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, et al. Wonder3d: Single image to 3d using cross-domain diffusion. arXiv preprint arXiv:2310.15008, 2023.
Lu et al. [2024] Jiahao Lu, Tianyu Huang, Peng Li, Zhiyang Dou, Cheng Lin, Zhiming Cui, Zhen Dong, Sai-Kit Yeung, Wenping Wang, and Yuan Liu. Align3r: Aligned monocular depth estimation for dynamic videos. arXiv preprint arXiv:2412.03079, 2024.
Ma et al. [2024a] Baorui Ma, Huachen Gao, Haoge Deng, Zhengxiong Luo, Tiejun Huang, Lulu Tang, and Xinlong Wang. You see it, you got it: Learning 3d creation on pose-free videos at scale. arXiv preprint arXiv:2412.06699, 2024a.
Ma et al. [2024b] Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai yu, Liang Zhao, Yisong Wang, Jiaying Liu, and Chong Ruan. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation, 2024b.
Ma et al. [2025] Yueen Ma, Yuzheng Zhuang, Jianye Hao, and Irwin King. 3d-moe: A mixture-of-experts multi-modal llm for 3d vision and pose diffusion via rectified flow. arXiv preprint arXiv:2501.16698, 2025.
Mai et al. [2024] Jinjie Mai, Wenxuan Zhu, Sara Rojas, Jesus Zarzar, Abdullah Hamdi, Guocheng Qian, Bing Li, Silvio Giancola, and Bernard Ghanem. Tracknerf: Bundle adjusting nerf from sparse and noisy views via feature tracks. In European Conference on Computer Vision, pages 470–489. Springer, 2024.
Melas-Kyriazi et al. [2024] Luke Melas-Kyriazi, Iro Laina, Christian Rupprecht, Natalia Neverova, Andrea Vedaldi, Oran Gafni, and Filippos Kokkinos. Im-3d: Iterative multiview diffusion and reconstruction for high-quality 3d generation, 2024.
Menapace et al. [2024] Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Ekaterina Deyneka, Tsai-Shien Chen, Anil Kag, Yuwei Fang, Aleksei Stoliar, Elisa Ricci, Jian Ren, et al. Snap video: Scaled spatiotemporal transformers for text-to-video synthesis. In CVPR, 2024.
Mildenhall et al. [2021] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
Mittal et al. [2023] Mayank Mittal, Calvin Yu, Qinxi Yu, Jingzhou Liu, Nikita Rudin, David Hoeller, Jia Lin Yuan, Ritvik Singh, Yunrong Guo, Hammad Mazhar, Ajay Mandlekar, Buck Babich, Gavriel State, Marco Hutter, and Animesh Garg. Orbit: A unified simulation framework for interactive robot learning environments. IEEE Robotics and Automation Letters, 8(6):3740–3747, 2023.
Mur-Artal et al. [2015] Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: A versatile and accurate monocular slam system. IEEE transactions on robotics, 31(5):1147–1163, 2015.
Murai et al. [2024] Riku Murai, Eric Dexheimer, and Andrew J. Davison. Mast3r-slam: Real-time dense slam with 3d reconstruction priors, 2024.
Neal [2001] Radford M Neal. Annealed importance sampling. Statistics and computing, 2001.
Ngo et al. [2024] Tuan Duc Ngo, Peiye Zhuang, Chuang Gan, Evangelos Kalogerakis, Sergey Tulyakov, Hsin-Ying Lee, and Chaoyang Wang. Delta: Dense efficient long-range 3d tracking for any video. arXiv preprint arXiv:2410.24211, 2024.
Özyeşil et al. [2017] Onur Özyeşil, Vladislav Voroninski, Ronen Basri, and Amit Singer. A survey of structure from motion*. Acta Numerica, 26:305–364, 2017.
Park et al. [2024] Jongmin Park, Minh-Quan Viet Bui, Juan Luis Gonzalez Bello, Jaeho Moon, Jihyong Oh, and Munchurl Kim. Splinegs: Robust motion-adaptive spline for real-time dynamic 3d gaussians from monocular video, 2024.
Parker-Holder et al. [2024] Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexandre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, Stephen Spencer, Jessica Yung, Michael Dennis, Sultan Kenjeyev, Shangbang Long, Vlad Mnih, Harris Chan, Maxime Gazeau, Bonnie Li, Fabio Pardo, Luyu Wang, Lei Zhang, Frederic Besse, Tim Harley, Anna Mitenkova, Jane Wang, Jeff Clune, Demis Hassabis, Raia Hadsell, Adrian Bolton, Satinder Singh, and Tim Rocktäschel. Genie 2: A large-scale foundation world model. 2024.
Peebles and Xie [2023] William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023.
Plastria [2011] Frank Plastria. The weiszfeld algorithm: proof, amendments, and extensions. Foundations of location analysis, pages 357–389, 2011.
Poole et al. [2022] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
Popov et al. [2025] Stefan Popov, Amit Raj, Michael Krainin, Yuanzhen Li, William T Freeman, and Michael Rubinstein. Camctrl3d: Single-image scene exploration with precise 3d camera control. arXiv preprint arXiv:2501.06006, 2025.
Pumarola et al. [2020] Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-nerf: Neural radiance fields for dynamic scenes, 2020.
Qian et al. [2023] Guocheng Qian, Jinjie Mai, Abdullah Hamdi, Jian Ren, Aliaksandr Siarohin, Bing Li, Hsin-Ying Lee, Ivan Skorokhodov, Peter Wonka, Sergey Tulyakov, et al. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843, 2023.
Qiu et al. [2024] Haonan Qiu, Zhaoxi Chen, Zhouxia Wang, Yingqing He, Menghan Xia, and Ziwei Liu. Freetraj: Tuning-free trajectory control in video diffusion models. arXiv preprint arXiv:2406.16863, 2024.
Qiu et al. [2023] Lingteng Qiu, Guanying Chen, Xiaodong Gu, Qi zuo, Mutian Xu, Yushuang Wu, Weihao Yuan, Zilong Dong, Liefeng Bo, and Xiaoguang Han. Richdreamer: A generalizable normal-depth diffusion model for detail richness in text-to-3d. arXiv preprint arXiv:2311.16918, 2023.
Rajasegaran et al. [2025] Jathushan Rajasegaran, Xinlei Chen, Rulilong Li, Christoph Feichtenhofer, Jitendra Malik, and Shiry Ginosar. Gaussian masked autoencoders. arXiv preprint arXiv:2501.03229, 2025.
Ramesh et al. [2022] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
Ravishankar et al. [2024] Rahul Ravishankar, Zeeshan Patel, Jathushan Rajasegaran, and Jitendra Malik. Scaling properties of diffusion models for perceptual tasks. arXiv preprint arXiv:2411.08034, 2024.
Ren et al. [2024] Jiawei Ren, Kevin Xie, Ashkan Mirzaei, Hanxue Liang, Xiaohui Zeng, Karsten Kreis, Ziwei Liu, Antonio Torralba, Sanja Fidler, Seung Wook Kim, and Huan Ling. L4gm: Large 4d gaussian reconstruction model. In Proceedings of Neural Information Processing Systems(NeurIPS), 2024.
Ren et al. [2025] Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas Müller, Alexander Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world-consistent video generation with precise camera control. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025.
Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
Saharia et al. [2022] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. In NIPS, 2022.
Sargent et al. [2023] Kyle Sargent, Zizhang Li, Tanmay Shah, Charles Herrmann, Hong-Xing Yu, Yunzhi Zhang, Eric Ryan Chan, Dmitry Lagun, Li Fei-Fei, Deqing Sun, et al. Zeronvs: Zero-shot 360-degree view synthesis from a single image. arXiv preprint arXiv:2310.17994, 2023.
Schönberger and Frahm [2016] Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
Shao et al. [2024] Jiahao Shao, Yuanbo Yang, Hongyu Zhou, Youmin Zhang, Yujun Shen, Vitor Guizilini, Yue Wang, Matteo Poggi, and Yiyi Liao. Learning temporally consistent video depth from video diffusion priors, 2024.
Shen et al. [2025] Qiuhong Shen, Xuanyu Yi, Mingbao Lin, Hanwang Zhang, Shuicheng Yan, and Xinchao Wang. Seeing world dynamics in a nutshell, 2025.
Shi et al. [2023a] Ruoxi Shi, Hansheng Chen, Zhuoyang Zhang, Minghua Liu, Chao Xu, Xinyue Wei, Linghao Chen, Chong Zeng, and Hao Su. Zero123++: a single image to consistent multi-view diffusion base model, 2023a.
Shi et al. [2023b] Yichun Shi, Peng Wang, Jianglong Ye, Long Mai, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d generation. arXiv:2308.16512, 2023b.
Singer et al. [2023] Uriel Singer, Shelly Sheynin, Adam Polyak, Oron Ashual, Iurii Makarov, Filippos Kokkinos, Naman Goyal, Andrea Vedaldi, Devi Parikh, Justin Johnson, et al. Text-to-4d dynamic scene generation. arXiv preprint arXiv:2301.11280, 2023.
Smith et al. [2023] Cameron Smith, Yilun Du, Ayush Tewari, and Vincent Sitzmann. Flowcam: Training generalizable 3d radiance fields without camera poses via pixel-aligned scene flow. arXiv preprint arXiv:2306.00180, 2023.
Smith et al. [2024] Cameron Smith, David Charatan, Ayush Tewari, and Vincent Sitzmann. Flowmap: High-quality camera poses, intrinsics, and depth via gradient descent. arXiv preprint arXiv:2404.15259, 2024.
Sturm et al. [2012] Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evaluation of rgb-d slam systems. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pages 573–580. IEEE, 2012.
Tang et al. [2024a] Jiaxiang Tang, Zhaoshuo Li, Zekun Hao, Xian Liu, Gang Zeng, Ming-Yu Liu, and Qinsheng Zhang. Edgerunner: Auto-regressive auto-encoder for artistic mesh generation. arXiv preprint arXiv:2409.18114, 2024a.
Tang et al. [2024b] Zhenggang Tang, Yuchen Fan, Dilin Wang, Hongyu Xu, Rakesh Ranjan, Alexander Schwing, and Zhicheng Yan. Mv-dust3r+: Single-stage scene reconstruction from sparse views in 2 seconds. arXiv preprint arXiv:2412.06974, 2024b.
Team [2025] Wan Team. Wan: Open and advanced large-scale video generative models. 2025.
Teed and Deng [2018] Zachary Teed and Jia Deng. Deepv2d: Video to depth with differentiable structure from motion. arXiv preprint arXiv:1812.04605, 2018.
Teed and Deng [2020] Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 402–419. Springer, 2020.
Teed and Deng [2021] Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. Advances in neural information processing systems, 34:16558–16569, 2021.
Tochilkin et al. [2024] Dmitry Tochilkin, David Pankratz, Zexiang Liu, Zixuan Huang, , Adam Letts, Yangguang Li, Ding Liang, Christian Laforte, Varun Jampani, and Yan-Pei Cao. Triposr: Fast 3d object reconstruction from a single image. arXiv preprint arXiv:2403.02151, 2024.
Tong et al. [2022] Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems, 35:10078–10093, 2022.
Triggs et al. [2000] Bill Triggs, Philip F McLauchlan, Richard I Hartley, and Andrew W Fitzgibbon. Bundle adjustment—a modern synthesis. In Vision Algorithms: Theory and Practice: International Workshop on Vision Algorithms Corfu, Greece, September 21–22, 1999 Proceedings, pages 298–372. Springer, 2000.
Truong et al. [2023] Prune Truong, Marie-Julie Rakotosaona, Fabian Manhardt, and Federico Tombari. Sparf: Neural radiance fields from sparse and noisy poses. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4190–4200, 2023.
Valevski et al. [2024] Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. arXiv preprint arXiv:2408.14837, 2024.
Voleti et al. [2024] Vikram Voleti, Chun-Han Yao, Mark Boss, Adam Letts, David Pankratz, Dmitry Tochilkin, Christian Laforte, Robin Rombach, and Varun Jampani. Sv3d: Novel multi-view synthesis and 3d generation from a single image using latent video diffusion, 2024.
Wang and Agapito [2024] Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory. arXiv preprint arXiv:2408.16061, 2024.
Wang et al. [2023a] Jianyuan Wang, Christian Rupprecht, and David Novotny. PoseDiffusion: Solving pose estimation via diffusion-aided bundle adjustment. 2023a.
Wang et al. [2023b] Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023b.
Wang et al. [2024a] Jianyuan Wang, Nikita Karaev, Christian Rupprecht, and David Novotny. Vggsfm: Visual geometry grounded deep structure from motion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21686–21697, 2024a.
Wang et al. [2023c] Peng Wang, Hao Tan, Sai Bi, Yinghao Xu, Fujun Luan, Kalyan Sunkavalli, Wenping Wang, Zexiang Xu, and Kai Zhang. Pf-lrm: Pose-free large reconstruction model for joint pose and shape prediction. arXiv preprint arXiv:2311.12024, 2023c.
Wang et al. [2025] Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. arXiv preprint arXiv:2501.12387, 2025.
Wang et al. [2024b] Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697–20709, 2024b.
Wang et al. [2020] Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Sebastian Scherer. Tartanair: A dataset to push the limits of visual slam. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4909–4916. IEEE, 2020.
Wang et al. [2024c] Yihan Wang, Lahav Lipson, and Jia Deng. Sea-raft: Simple, efficient, accurate raft for optical flow. In European Conference on Computer Vision, pages 36–54. Springer, 2024c.
Wang et al. [2024d] Yuqing Wang, Tianwei Xiong, Daquan Zhou, Zhijie Lin, Yang Zhao, Bingyi Kang, Jiashi Feng, and Xihui Liu. Loong: Generating minute-level long videos with autoregressive language models, 2024d.
Wang et al. [2021] Zirui Wang, Shangzhe Wu, Weidi Xie, Min Chen, and Victor Adrian Prisacariu. Nerf–: Neural radiance fields without known camera parameters. arXiv preprint arXiv:2102.07064, 2021.
Wang et al. [2024e] Zhengyi Wang, Jonathan Lorraine, Yikai Wang, Hang Su, Jun Zhu, Sanja Fidler, and Xiaohui Zeng. Llama-mesh: Unifying 3d mesh generation with language models, 2024e.
Wu et al. [2024a] Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering, 2024a.
Wu et al. [2023] Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In ICCV, 2023.
Wu et al. [2024b] Rundi Wu, Ruiqi Gao, Ben Poole, Alex Trevithick, Changxi Zheng, Jonathan T Barron, and Aleksander Holynski. Cat4d: Create anything in 4d with multi-view video diffusion models. arXiv preprint arXiv:2411.18613, 2024b.
Wu et al. [2024c] Rundi Wu, Ben Mildenhall, Philipp Henzler, Keunhong Park, Ruiqi Gao, Daniel Watson, Pratul P Srinivasan, Dor Verbin, Jonathan T Barron, Ben Poole, et al. Reconfusion: 3d reconstruction with diffusion priors. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21551–21561, 2024c.
Xiang et al. [2024] Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d generation. arXiv preprint arXiv:2412.01506, 2024.
Xie et al. [2024a] Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, and Song Han. Sana: Efficient high-resolution image synthesis with linear diffusion transformer, 2024a.
Xie et al. [2025] Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, Han Cai, et al. Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer. arXiv preprint arXiv:2501.18427, 2025.
Xie et al. [2024b] Yiming Xie, Chun-Han Yao, Vikram Voleti, Huaizu Jiang, and Varun Jampani. Sv4d: Dynamic 3d content generation with multi-frame and multi-view consistency. arXiv preprint arXiv:2407.17470, 2024b.
Yang et al. [2025] Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. arXiv preprint arXiv:2501.13928, 2025.
Yang et al. [2024] Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2. Advances in Neural Information Processing Systems, 37:21875–21911, 2024.
Ye et al. [2024a] Botao Ye, Sifei Liu, Haofei Xu, Xueting Li, Marc Pollefeys, Ming-Hsuan Yang, and Songyou Peng. No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images. arXiv preprint arXiv:2410.24207, 2024a.
Ye et al. [2024b] Weicai Ye, Xinyu Chen, Ruohao Zhan, Di Huang, Xiaoshui Huang, Haoyi Zhu, Hujun Bao, Wanli Ouyang, Tong He, and Guofeng Zhang. Datap-sfm: Dynamic-aware tracking any point for robust structure from motion in the wild, 2024b.
Yeshwanth et al. [2023] Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023.
Yin et al. [2024] Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. arXiv preprint arXiv:2412.07772, 2024.
Yin and Shi [2018] Zhichao Yin and Jianping Shi. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1983–1992, 2018.
Yu et al. [2024a] Lijun Yu, José Lezama, Nitesh B. Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, Alexander G. Hauptmann, Boqing Gong, Ming-Hsuan Yang, Irfan Essa, David A. Ross, and Lu Jiang. Language model beats diffusion – tokenizer is key to visual generation, 2024a.
Yu et al. [2024b] Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis. arXiv preprint arXiv:2409.02048, 2024b.
Zeng et al. [2024] Yifei Zeng, Yanqin Jiang, Siyu Zhu, Yuanxun Lu, Youtian Lin, Hao Zhu, Weiming Hu, Xun Cao, and Yao Yao. Stag4d: Spatial-temporal anchored generative 4d gaussians. In European Conference on Computer Vision, pages 163–179. Springer, 2024.
Zhang et al. [2023] David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023.
Zhang et al. [2024a] Ganlin Zhang, Erik Sandström, Youmin Zhang, Manthan Patel, Luc Van Gool, and Martin R Oswald. Glorie-slam: Globally optimized rgb-only implicit encoding point cloud slam. arXiv preprint arXiv:2403.19549, 2024a.
Zhang et al. [2024b] Haiyu Zhang, Xinyuan Chen, Yaohui Wang, Xihui Liu, Yunhong Wang, and Yu Qiao. 4diffusion: Multi-view video diffusion model for 4d generation. arXiv preprint arXiv:2405.20674, 2024b.
Zhang et al. [2024c] Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. Monst3r: A simple approach for estimating geometry in the presence of motion. arXiv preprint arXiv:2410.03825, 2024c.
Zhang et al. [2024d] Jason Y Zhang, Amy Lin, Moneish Kumar, Tzu-Hsuan Yang, Deva Ramanan, and Shubham Tulsiani. Cameras as rays: Pose estimation via ray diffusion. arXiv preprint arXiv:2402.14817, 2024d.
Zhang et al. [2024e] Kai Zhang, Sai Bi, Hao Tan, Yuanbo Xiangli, Nanxuan Zhao, Kalyan Sunkavalli, and Zexiang Xu. Gs-lrm: Large reconstruction model for 3d gaussian splatting. In European Conference on Computer Vision, pages 1–19. Springer, 2024e.
Zhang et al. [2024f] Longwen Zhang, Ziyu Wang, Qixuan Zhang, Qiwei Qiu, Anqi Pang, Haoran Jiang, Wei Yang, Lan Xu, and Jingyi Yu. Clay: A controllable large-scale generative model for creating high-quality 3d assets. ACM Transactions on Graphics (TOG), 43(4):1–20, 2024f.
Zhang et al. [2024g] Qihang Zhang, Shuangfei Zhai, Miguel Angel Bautista, Kevin Miao, Alexander Toshev, Joshua Susskind, and Jiatao Gu. World-consistent video diffusion with explicit 3d modeling. arXiv preprint arXiv:2412.01821, 2024g.
Zhang et al. [2025] Shangzhan Zhang, Jianyuan Wang, Yinghao Xu, Nan Xue, Christian Rupprecht, Xiaowei Zhou, Yujun Shen, and Gordon Wetzstein. Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views. arXiv preprint arXiv:2502.12138, 2025.
Zhang et al. [2022] Zhoutong Zhang, Forrester Cole, Zhengqi Li, Michael Rubinstein, Noah Snavely, and William T Freeman. Structure and motion from casual videos. In European Conference on Computer Vision, pages 20–37. Springer, 2022.
Zhao et al. [2022] Wang Zhao, Shaohui Liu, Hengkai Guo, Wenping Wang, and Yong-Jin Liu. Particlesfm: Exploiting dense point trajectories for localizing moving cameras in the wild. In European conference on computer vision (ECCV), 2022.
Zheng et al. [2023] Yang Zheng, Adam W. Harley, Bokui Shen, Gordon Wetzstein, and Leonidas J. Guibas. Pointodyssey: A large-scale synthetic dataset for long-term point tracking. In ICCV, 2023.
Zheng et al. [2024] Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all, 2024. Last accessed: November 15, 2024.
Zhou et al. [2018] Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view synthesis using multiplane images. arXiv preprint arXiv:1805.09817, 2018.
Zhu et al. [2025] Hanxin Zhu, Tianyu He, Xiqian Yu, Junliang Guo, Zhibo Chen, and Jiang Bian. Ar4d: Autoregressive 4d generation from monocular videos. arXiv preprint arXiv:2501.01722, 2025.
Zhu et al. [2022] Zihan Zhu, Songyou Peng, Viktor Larsson, Weiwei Xu, Hujun Bao, Zhaopeng Cui, Martin R Oswald, and Marc Pollefeys. Nice-slam: Neural implicit scalable encoding for slam. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12786–12796, 2022.