Genesis: Multimodal Driving Scene Generation with

Spatio-Temporal and Cross-Modal Consistency

1Huazhong University of Science & Technology 2Xiaomi EV

*Equal Contributions. Intern of Xiaomi EV. Project Leader. Corresponding Author.

Abstract

We present Genesis, a unified framework for joint generation of multi-view driving videos and LiDAR sequences with spatio-temporal and cross-modal consistency. Genesis employs a two-stage architecture that integrates a DiT-based video diffusion model with 3D-VAE encoding, and a BEV-aware LiDAR generator with NeRF-based rendering and adaptive sampling. Both modalities are directly coupled through a shared latent space, enabling coherent evolution across visual and geometric domains. To guide the generation with structured semantics, we introduce DataCrafter, a captioning module built on vision-language models that provides scene-level and instance-level supervision. Extensive experiments on the nuScenes benchmark demonstrate that Genesis achieves state-of-the-art performance across video and LiDAR metrics (FVD 16.95, FID 4.24, Chamfer 0.611), and benefits downstream tasks including segmentation and 3D detection, validating the semantic fidelity and practical utility of the generated data.


Framework

Genesis employs a unified pipeline where both video and LiDAR branches operate within a shared latent space. Vision and geometry are directly coupled through a novel cross-modal conditioning mechanism, enabling consistent temporal evolution and geometric alignment across modalities without relying on occupancy or voxel intermediates.

Joint Generation on nuScenes

Clip1

Clip2

Long-term Multi-View Video Generation on nuScenes

Videos are generated conditioned on 5 initial frames from our nuScenes.

Day & Night Generation on nuScenes

Trajectory-Controlled Generation on nuScenes

Joint Generation on Private Data

Long-term Generation(40s) on Private Data

Videos are generated conditioned on 17 initial frames from our Priavte Data.

Video Generation Results

Video Generation Comparison on nuScenes validation set, where green and blue represent the best and the second best values.

LiDAR Generation Results

Lidar Generation Comparison on nuScenes validation set, where green and blue represent the best and the second best values. ‘gt_img” and gen_img” indicate using ground-truth or generated images as BEV condition input, respectively.

BibTeX

@article{guo2025genesis,
      title={Genesis: Multimodal Driving Scene Generation with Spatio-Temporal and Cross-Modal Consistency},
      author={Guo, Xiangyu and Wu, Zhanqian and Xiong, Kaixin and Xu, Ziyang and Zhou, Lijun and Xu, Gangwei and Xu, Shaoqing and Sun, Haiyang and Wang, Bing and Chen, Guang and others},
      journal={arXiv preprint arXiv:2506.07497},
      year={2025}
}