Recent progress in driving video generation has shown significant potential for enhancing self-driving systems by providing scalable and controllable training data. Although pretrained state-of-the-art generation models, guided by 2D layout conditions (e.g., HD maps and bounding boxes), can produce photorealistic driving videos, achieving controllable multi-view videos with high 3D consistency remains a major challenge. To tackle this, we introduce a novel spatial adaptive generation framework, CoGen, which leverages advances in 3D generation to improve performance in two key aspects: (i) To ensure 3D consistency, we first generate high-quality, controllable 3D conditions that capture the geometry of driving scenes. By replacing coarse 2D conditions with these fine-grained 3D representations, our approach significantly enhances the spatial consistency of the generated videos. (ii) Additionally, we introduce a consistency adapter module to strengthen the robustness of the model to multi-condition control. The results demonstrate that this method excels in preserving geometric fidelity and visual realism, offering a reliable video generation solution for autonomous driving.
Overview of our model. (a) Training and inference pipeline. Using BEV maps as conditions, we generate temporal 3D semantics sequences, which are then projected and encoded to provide guidance for video generation. During projection, a foreground object mask is created and incorporated into training with a foreground mask loss reweight, enhancing supervision for foreground generation quality. (b) Details of 3D semantics projection and encoding. Various forms of guidance are fused through 1 × 1 convolutions. (c) Illustration of our diffusion transformer architecture.
Visualization of the 3d semantics conditions used for video generation. Each condition is derived by projecting the 3D semantics grid into the camera view using ray casting, capturing essential geometric and semantic information for enhanced video generation.
Quantitative comparison on video generation quality with other methods. Our method achieves the best FVD score.
Comparison with baselines for video generation controllability. Results are calculated with first 16 frames of videos.
Qualitative visualization of 3D semantics Mask Loss's effects.
FVD scores comparison for different model settings. Incorporating the adapter consistently lowers FVD across 8, 16, 28, and 40 frame sequences.
'Sem Dep' denotes Semantic Map and Depth Map, while 'MPI Coor' refers to MPI and Coordinate Map. 'Adapter' indicates the consistency adapter, and '3D-Sem' represents 3D semantics-based guidance (GT for ground truth, GEN for our generated 3D semantics).