CoGen: 3D Consistent Video Generation via Adaptive Conditioning for Autonomous Driving

Yishen Ji1,2   Ziyue Zhu2,3   Zhenxin Zhu2   Kaixin Xiong2   Ming Lu4   Zhiqi Li1
Lijun Zhou2,†   Haiyang Sun2,†   Bing Wang2,✉   Tong Lu1,✉
1Nanjing University 2Xiaomi EV 3Nankai University 4Peking University
jiyishen929@smail.nju.edu.cn
project leader. corresponding author.

Abstract

Recent progress in driving video generation has shown significant potential for enhancing self-driving systems by providing scalable and controllable training data. Although pretrained state-of-the-art generation models, guided by 2D layout conditions (e.g., HD maps and bounding boxes), can produce photorealistic driving videos, achieving controllable multi-view videos with high 3D consistency remains a major challenge. To tackle this, we introduce a novel spatial adaptive generation framework, CoGen, which leverages advances in 3D generation to improve performance in two key aspects: (i) To ensure 3D consistency, we first generate high-quality, controllable 3D conditions that capture the geometry of driving scenes. By replacing coarse 2D conditions with these fine-grained 3D representations, our approach significantly enhances the spatial consistency of the generated videos. (ii) Additionally, we introduce a consistency adapter module to strengthen the robustness of the model to multi-condition control. The results demonstrate that this method excels in preserving geometric fidelity and visual realism, offering a reliable video generation solution for autonomous driving.

Model

Model Architecture

Overview of our model. (a) Training and inference pipeline. Using BEV maps as conditions, we generate temporal 3D semantics sequences, which are then projected and encoded to provide guidance for video generation. During projection, a foreground object mask is created and incorporated into training with a foreground mask loss reweight, enhancing supervision for foreground generation quality. (b) Details of 3D semantics projection and encoding. Various forms of guidance are fused through 1 × 1 convolutions. (c) Illustration of our diffusion transformer architecture.

Visualization of the generated 3D conditions

Visualization of the 3d semantics conditions used for video generation. Each condition is derived by projecting the 3D semantics grid into the camera view using ray casting, capturing essential geometric and semantic information for enhanced video generation.

Video Demos

Main Results

Results

Quantitative comparison on video generation quality with other methods. Our method achieves the best FVD score.

Controllability

Comparison with baselines for video generation controllability. Results are calculated with first 16 frames of videos.

Ablation Study

Mask Loss

Qualitative visualization of 3D semantics Mask Loss's effects.

FVD Comparison

FVD scores comparison for different model settings. Incorporating the adapter consistently lowers FVD across 8, 16, 28, and 40 frame sequences.

Ablation Study

'Sem Dep' denotes Semantic Map and Depth Map, while 'MPI Coor' refers to MPI and Coordinate Map. 'Adapter' indicates the consistency adapter, and '3D-Sem' represents 3D semantics-based guidance (GT for ground truth, GEN for our generated 3D semantics).