Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future

WorldBench Team

arXiv

PDF

GitHub

Leaderboard

Introduction

The pursuit of fully autonomous driving (AD) has long been a central goal in AI and robotics. Conventional AD systems typically adopt a modular "Perception-Decision-Action" pipeline, where mapping, object detection, motion prediction, and trajectory planning are developed and optimized as separate components.

While this design has achieved strong performance in structured environments, its reliance on hand-crafted interfaces and rules limits adaptability in complex, dynamic, and long-tailed scenarios.

This survey reviews Vision-Language-Action (VLA) models — an emerging paradigm that integrates visual perception, natural language reasoning, and executable actions for autonomous driving. We trace the evolution from traditional Vision-Action (VA) approaches to modern VLA frameworks. Charting the evolution from precursor VA models to modern VLA frameworks, we provide historical context and clarify the motivations behind this paradigm shift.

Definition

Vision-Action (VA) Models

A vision-centric driving system that directly maps raw sensory observations to driving actions, thereby avoiding explicit modular decomposition into perception, prediction, and planning. VA models learn end-to-end policies through imitation learning or reinforcement learning.

End-to-End Models World Models Imitation Learning Reinforcement Learning Trajectory Prediction

Vision-Language-Action (VLA) Models

A multimodal reasoning system that couples visual perception with large VLMs to produce executable driving actions. VLAs integrate visual understanding, linguistic reasoning, and actionable outputs within a unified framework, enabling more interpretable, generalizable, and human-aligned driving policies through natural language instructions and chain-of-thought reasoning.

End-to-End VLA Dual-System VLA Chain-of-Thought Instruction Following Interpretable Reasoning

Collections