DGGT: Feedforward 4D Reconstruction of Dynamic Driving Scenes Using Unposed Images

Xiaoxue Chen¹,²*, Ziyi Xiong¹,²*, Yuantao Chen¹, Gen Li¹, Nan Wang¹,
Hongcheng Luo², Long Chen², Haiyang Sun²†, Bing Wang², Guang Chen², Hangjun Ye²,✉,
Hongyang Li³, Ya-Qin Zhang¹, Hao Zhao¹,⁴,✉
¹ AIR, Tsinghua University    ² Xiaomi EV    ³ The University of Hong Kong    ⁴ Beijing Academy of Artificial Intelligence
* These authors contributed equally    † Project leader

DGGT Introduction Video

Abstract

Autonomous driving needs fast, scalable 4D reconstruction and re-simulation for training and evaluation, yet most methods for dynamic driving scenes still rely on per-scene optimization, known camera calibration, or short frame windows, making them slow and impractical. We revisit this problem from a feedforward perspective and note that the existing formulations, treating camera pose as a required input, limits flexibility and scalability. Instead, we reformulate pose as an output of the model, enabling reconstruction directly from sparse, unposed images and supporting an arbitrary number of views for long sequences. Our approach jointly predicts per-frame 3D Gaussian maps and camera parameters, disentangles dynamics with a lightweight dynamic head, and preserves temporal consistency with a lifespan head that modulates visibility over time. A diffusion-based rendering refinement further reduces motion/interpolation artifacts and improves novel-view quality under sparse inputs. The result is a single-pass, pose-free algorithm that achieves state-of-the-art performance and speed. Trained and evaluated on large-scale driving benchmarks (Waymo, nuScenes, Argoverse2), our method outperforms prior work both when trained on each dataset and in zero-shot transfer across datasets, and it scales well as the number of input frames increases.

Method

Method

DGGT reconstructs temporally coherent 3D scenes from unposed image sequences in a single feed-forward pass. Our core architecture is a transformer-based network that jointly predicts per-frame camera parameters and a pixel-aligned 3D Gaussian field. Each Gaussian encodes appearance and geometry — color, 3D position, rotation, scale, and opacity — together with a learned lifespan that controls its temporal visibility. To model dynamics, a dedicated motion head fuses 2D image features with the 3D Gaussian points to form a spatio-temporal feature cloud and predicts consistent 3D trajectories for moving objects. The geometry and motion components are trained end-to-end to produce a coherent 4D representation. Finally, a separately trained diffusion-based rendering module refines composed renders to remove artifacts and produce high-fidelity, photorealistic outputs.

Results

Comparison on Waymo Open Dataset

We mainly evaluate our approach on the Waymo Open Dataset, focusing on reconstructed-image quality (PSNR, SSIM, LPIPS) and aligned depth accuracy (aligned RMSE / D-RMSE), and compare directly to recent feed-forward and spatio-temporal baselines including STORM, NoPoSplat(NoPo) and DepthSplat(Depth). On the same scenes our method yields consistently higher-fidelity, higher-resolution renderings, more accurate and temporally coherent depth maps, and improved reconstruction of dynamic objects with reduced ghosting and disocclusions.

Scenes Reconstruction

Scenes Reconstruction — Photorealistic, high-resolution renderings from Waymo frames that preserve fine texture and temporal coherence.



Depth Estimation​

Depth Estimation: We predict relative depth (scale and offset not fixed) and report errors after linear alignment.



Dynamic partitioning

Dynamic Partitioning separates moving objects via learned lifespans and per-pixel 3D motion, substantially reducing ghosting and temporal artifacts.


Scenes Editing

Scenes Editing — Our scene representation is built from composable 3D Gaussian primitives, which enables direct, geometry-aware edits at the primitive level. Users can delete individual people or objects, apply rigid transforms (translation and rotation) to selected primitives, or transplant dynamic primitives from other scenes into the current scene — all without retraining the model. A diffusion-based refinement step then inpaints disocclusions and harmonizes appearance, while the model’s lifespan and per-pixel motion predictions preserve temporal coherence and correct 3D placement of edited elements.

Remove human
Character movement

Remove vehicle
Vehicle movement

More Results on Argoverse2

We also evaluate our approach on Argoverse2

More Results on Nuscenes

We also evaluate our approach on Nuscenes