ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

1Huazhong University of Science and Technology, 2Xiaomi EV

*Equal Contributions. Project Lead. Corresponding Author.

Abstract

Although end-to-end autonomous driving has made remarkable progress, its performance degrades significantly in rare and long-tail scenarios. Recent approaches attempt to address this challenge by leveraging the rich world knowledge of Vision-Language Models (VLMs), but these methods suffer from several limitations: (1) a significant domain gap between the pre-training data of VLMs and real-world driving data, (2) a dimensionality mismatch between the discrete language space and the continuous action space, (3) imitation learning tends to capture the average behavior present in the dataset, which may be suboptimal even dangerous. In this paper, we propose CogDrive, an autonomous driving system that integrates VLMs with a diffusion planner, adopting a three-stage paradigm for training. In the first stage, we use a large-scale driving question-answering dataset to train the VLMs, mitigating the domain discrepancy between generic content and real-world driving scenarios. In the second stage, we employ a diffusion-based planner to perform imitation learning, mapping representations from the latent language space to continuous driving actions. Finally, we fine-tune the diffusion planner using reinforcement learning in the NAVSIM non-reactive simulator, enabling the model to generate safer, more human-like driving trajectories. We evaluate our approach on the planning-oriented NAVSIM benchmark, achieving a PDMS of 89.6 and setting a new state-of-the-art that surpasses the previous vision-only SOTA by 5.6 PDMS.

Description of the image

Model Architecture and Training Pipeline

ReCogDrive System Overview

Overview of the ReCogDrive architecture and training pipeline. The system consists of a Vision-Language Model (VLM) and a diffusion planner. It takes as input a front-view image, navigation command, ego states, and task instruction. The VLM encodes multimodal information into latent features, which are passed to the diffusion planner to generate future trajectories by denoising from random noise. Training follows a three-stage paradigm: (1) the VLM is pre-trained on a large-scale driving QA dataset to adapt it to driving scenario; (2) the VLM is then frozen, and the diffusion planner is trained via imitation learning to mimic expert driving behaviors; (3) the planner is further fine-tuned using reinforcement learning with the assistance of the NAVSIM simulator, enabling it to predict safer, more stable, and more comfortable trajectories.

Simulator-assisted Reinforcement Learning.

Comparison of Imitation Learning and Simulator-assisted Reinforcement Learning

Imitation learning often struggles with the inherent diversity of expert demonstrations, which can lead to averaged and suboptimal trajectories, as illustrated in (a). ReCogDrive addresses this limitation with simulator-assisted reinforcement learning (RL), enabling the diffusion planner to explore and learn robust driving behaviors in a simulated environment. In (b), multiple trajectories are sampled from the diffusion planner within the non-reactive NAVSIM simulator and evaluated on safety, drivability, and comfort to compute a Predictive Driver Model Score (PDMS) as the reward. This reward is then processed through group computation to derive advantages, which are used to compute the policy loss. To ensure stability during learning, we combine the RL objective with a behavior cloning loss. This enables ReCogDrive to predict safer, smoother, and more reliable trajectories beyond imitation.

Performance comparison on NAVSIM.

Performance comparison on NAVSIM benchmark

The above figure presents the performance comparison on the NAVSIM benchmark. ReCogDrive achieves a Predictive Driver Model Score (PDMS) of 89.6, setting a new state-of-the-art. Despite relying solely on camera inputs, it surpasses LiDAR-augmented models such as DiffusionDrive and WoTE by 1.5 and 1.2 PDMS, respectively. Compared to fine-tuned baselines like InternVL3 and QwenVL2.5, ReCogDrive delivers a significant improvement of 6.3 PDMS, demonstrating the effectiveness of our three-stage training framework. It also outperforms the previous best camera-only method, PARA-Drive, by 5.6 PDMS.

Visualization.

ReCogDrive Perception and Planning Visualization Example

This visualization showcases ReCogDrive's comprehensive perception and planning capabilities. As demonstrated in the figure, in addition to generating precise and smooth trajectory predictions, our system also produces accurate scene summaries and clear, high-level driving instructions. ReCogDrive accurately identifies critical objects like taxis and traffic lights, seamlessly integrating this cognitive understanding to inform its robust planning decisions, demonstrating true end-to-end autonomous driving with enhanced cognition.

BibTeX

@article{li2025recogdrive,
  title={ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving},
  author={Li, Yongkang and Xiong, Kaixin and Guo, Xiangyu and Li, Fang and Yan, Sixu and Xu, Gangwei and Zhou, Lijun and Chen, Long and Sun, Haiyang and Wang, Bing and others},
  journal={arXiv preprint arXiv:2506.08052},
  year={2025}
}