DriveMRP: Enhancing Vision-Language Models with Synthetic Motion Data for Motion Risk Prediction

Zhiyi Hou1,2,3,*, Enhui Ma1,3,*, Fang Li2,*, Zhiyi Lai2, Kalok Ho2, Zhanqian Wu2, Lijun Zhou2, Long Chen2, Chitian Sun2, Haiyang Sun2,†, Bing Wang2, Guang Chen2, Hangjun Ye2, Kaicheng Yu1,✉
1Westlake University, 2Xiaomi EV, 3Zhejiang University
*Equal Contribution †Project Leader ✉Corresponding author

Method Overview

Method Overview Banner

We enhance vision-language models (VLMs) for motion risk prediction by synthesizing high-risk motion data via a Bird’s-Eye View (BEV)-based motion simulation and introducing a VLM-agnostic framework with a projection-based visual prompting scheme to address the modality gap.

Abstract

Autonomous driving has seen significant progress, driven by extensive real-world data. However, in long-tail scenarios, accurately predicting the safety of the ego vehicle’s future motion remains a major challenge due to uncertainties in dynamic environments and limitations in data coverage. In this work, we aim to explore whether it is possible to enhance the motion risk prediction capabilities of Vision-Language Models (VLM) by synthesizing high-risk motion data. Specifically, we introduce a Bird’s-Eye View (BEV) based motion simulation method to model risks from three aspects: the ego-vehicle, other vehicles, and the environment. This allows us to synthesize plug-and-play, high-risk motion data suitable for VLM training, which we call DriveMRP-10K. Furthermore, we design a VLM-agnostic motion risk estimation framework, named DriveMRP-Agent. This framework incorporates a novel information injection strategy for global context, ego-vehicle perspective, and trajectory projection, enabling VLMs to effectively reason about the spatial relationships between motion waypoints and the environment. Extensive experiments demonstrate that by fine-tuning with DriveMRP-10K, our DriveMRP-Agent framework can significantly improve the motion risk prediction performance of multiple VLM baselines, with the accident recognition accuracy soaring from 27.13% to 88.03%. Moreover, when tested via zero-shot evaluation on an in-house real-world high-risk motion dataset, DriveMRP-Agent achieves a significant performance leap, boosting the accuracy from base_model’s 29.42% to 68.50%, which showcases the strong generalization capabilities of our method in real-world scenarios.

Method

1 DriveMRP-10K

DriveMRP-10K Dataset Generation Pipeline

A synthetic dataset of high-risk driving motions built from nuPlan. Uses BEV-based simulation to model risks from ego-vehicle, other agents, and environment. Includes trajectory generation, human-in-the-loop labeling, and GPT-4o captions, yielding 10K multimodal samples for VLM training.

2 DriveMRP-Agent

DriveMRP-Agent Framework Architecture

A VLM-agnostic framework based on Qwen2.5VL-7B. Employs projection-based visual prompting to bridge numerical coordinates and images. Combines BEV and front-view contexts to enable chain-of-thought reasoning for motion risk prediction.

Risk scenarios in DriveMRP - 10K

Emergency acceleration scenario

Emergency braking scenario

Collision scenario

Illegal lane change scenario

DriveMRP-Agent Inference Results

Illegal Lane Change Risk Case

Case 1: Illegal Lane Change Risk

Comparison of risk predictions across models for an illegal lane change scenario. DriveMRP accurately identifies the risk (ground truth: illegal lane change), while baseline models (Qwen2.5VL-7B, Intern2.5VL-8B) misclassify it as "no risk".

Abnormal Deceleration Risk Case

Case 2: Abnormal Deceleration Risk

Model performance on an abnormal deceleration scenario. DriveMRP detects the risk from trajectory color changes (ground truth: abnormal deceleration), while baselines fail to recognize the sudden speed drop.

Collision Risk Case

Case 3: Collision Risk

Collision risk evaluation at an intersection. DriveMRP identifies the threat from the ego-vehicle’s trajectory proximity to the black car (ground truth: collision risk), while baselines misclassify it as "no risk".

BibTeX

@inproceedings{hou2025drivemrp,
  title     = {DriveMRP: Enhancing Vision-Language Models with Synthetic Motion Data for Motion Risk Prediction},
  author    = {Hou, Zhiyi and Ma, Enhui and Li, Fang and Lai, Zhiyi and Ho, Kalok and Wu, Zhanqian and Zhou, Lijun and Chen, Long and Sun, Chitian and Sun, Haiyang and Wang, Bing and Chen, Guang and Ye, Hangjun and Yu, Kaicheng},
  year      = {2025}
}