Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future

Image  WorldBench Team

Teaser Image
Image  Introduction

The pursuit of fully autonomous driving (AD) has long been a central goal in AI and robotics. Conventional AD systems typically adopt a modular "Perception-Decision-Action" pipeline, where mapping, object detection, motion prediction, and trajectory planning are developed and optimized as separate components.

While this design has achieved strong performance in structured environments, its reliance on hand-crafted interfaces and rules limits adaptability in complex, dynamic, and long-tailed scenarios.

Image

This survey reviews Vision-Language-Action (VLA) models — an emerging paradigm that integrates visual perception, natural language reasoning, and executable actions for autonomous driving. We trace the evolution from traditional Vision-Action (VA) approaches to modern VLA frameworks. Charting the evolution from precursor VA models to modern VLA frameworks, we provide historical context and clarify the motivations behind this paradigm shift.



Image  Definition

Vision-Action (VA) Models

A vision-centric driving system that directly maps raw sensory observations to driving actions, thereby avoiding explicit modular decomposition into perception, prediction, and planning. VA models learn end-to-end policies through imitation learning or reinforcement learning.

End-to-End Models World Models Imitation Learning Reinforcement Learning Trajectory Prediction

Vision-Language-Action (VLA) Models

A multimodal reasoning system that couples visual perception with large VLMs to produce executable driving actions. VLAs integrate visual understanding, linguistic reasoning, and actionable outputs within a unified framework, enabling more interpretable, generalizable, and human-aligned driving policies through natural language instructions and chain-of-thought reasoning.

End-to-End VLA Dual-System VLA Chain-of-Thought Instruction Following Interpretable Reasoning


Image  Collections

Vision-Action (VA) Models

0 items

Vision-Language-Action (VLA) Models

0 items

Datasets & Benchmarks

0 items



Contributors

Image
Tianshuai Hu
Core Contributor
Image
Xiaolu Liu
Core Contributor
Image
Song Wang
Core Contributor
Image
Yiyao Zhu
Core Contributor
Image
Ao Liang
Core Contributor
Image
Lingdong Kong
Core Contributor, Project Lead
Image
Guoyang Zhao
Contributor, VLA
Image
Zeying Gong
Contributor, VLA
Image
Jun Cen
Contributor, VLA
Image
Zhiyu Huang
Contributor, VLA
Image
Xiaoshuai Hao
Contributor, VLA
Image
Linfeng Li
Contributor, End-to-End Models
Image
Hang Song
Contributor, End-to-End Models
Image
Xiangtai Li
Contributor, End-to-End Models
Image
Jun Ma
Advisor
Image
Shaojie Shen
Advisor
Image
Jianke Zhu
Advisor
Image
Dacheng Tao
Advisor
Image
Ziwei Liu
Advisor
Image
Junwei Liang
Advisor