Awesome Spatial Intelligence in VLMs

This carefully curated list brings together key methods, datasets, and benchmarks in the field of spatial intelligence for VLMs.

With the development of multimodal models, evaluating and enhancing their spatial intelligence has become a key research frontier. This list aims to provide researchers and engineers with a quick index to track the latest advancements in the field.

We welcome contributions of excellent resources you find via Pull Request!

Methods

Visual-based methods

Title	Date	Code
RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics	2025-12	Github
N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models	2025-12	Github
SpatialDreamer: Incentivizing Spatial Reasoning via Active Mental Imagery	2025-12	Github
Seeing through Imagination: Learning Scene Geometry via Implicit Spatial World Modeling	2025-12	Github
CVP: Central-Peripheral Vision-Inspired Multimodal Model for Spatial Reasoning	2025-12	-
S2-MLLM: Boosting Spatial Reasoning Capability of MLLMs for 3D Visual Grounding with Structural Guidance	2025-12	-
EagleVision: A Dual-Stage Framework with BEV-grounding-based Chain-of-Thought for Spatial Intelligence	2025-12	-
Video4Spatial: Towards Visuospatial Intelligence with Context-Guided Video Generation	2025-12	-
Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning	2025-11	Github
G2VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning	2025-11	Github
Video Spatial Reasoning with Object-Centric 3D Rollout	2025-11	-
Abstract 3D Perception for Spatial Intelligence in Vision-Language Models	2025-11	-
SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models	2025-11	-
Cambrian-S: Towards Spatial Supersensing in Video	2025-11	Github
Omni-View: Unlocking How Generation Facilitates Understanding in Unified 3D Model based on Multiview images	2025-11	Github
SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards	2025-11	Github
TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics	2025-10	Github
Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning	2025-10	Github
Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation	2025-10	Github
Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views	2025-10	Github
Euclid’s Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks	2025-10	Github
SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models	2025-10	Github
SpaceVista: All-Scale Visual Spatial Reasoning from mm to km	2025-10	Github
SD-VLM: Spatial Measuring and Understanding with Depth-Encoded Vision-Language Models	2025-09	Github
See&Trek: Training-Free Spatial Prompting for Multimodal Large Language Model	2025-09	-
3D Aware Region Prompted Vision Language Model	2025-09	Github
UniUGG: Unified 3D Understanding and Generation via Geometric-Semantic Encoding	2025-08	Github
SIFThinker: Spatially-Aware Image Focus for Visual Reasoning	2025-08	Github
Enhancing Spatial Reasoning through Visual and Textual Thinking	2025-07	-
MindJourney: Test-Time Scaling with World Models for Spatial Reasoning	2025-07	Github
Struct2D: A Perception-Guided Framework for Spatial Reasoning in Large Multimodal Models	2025-06	Github
Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs	2025-06	Github
SpatialLM: Training Large Language Models for Structured Indoor Modeling	2025-06	Github
Spatial Understanding from Videos: Structured Prompts Meet Simulation Data	2025-06	Github
Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing	2025-06	Github
Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces	2025-05	Github
RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics	2025-05	Github
Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors	2025-05	Github
SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models	2025-05	Github
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence	2025-05	Github
STAR-R1: Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs	2025-05	Github
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction	2025-05	Github
LLaVA-4D: Embedding SpatioTemporal Prompt into LMMs for 4D Scene Understanding	2025-05	-
SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning	2025-05	Github
Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation	2025-04	Github
Scaling and Beyond: Advancing Spatial Reasoning in MLLMs Requires New Recipe	2025-04	-
SpaceR: Reinforcing MLLMs in Video Spatial Reasoning	2025-04	Github
Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning	2025-04	Github
SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning	2025-04	Github
ROSS3D: Reconstructive Visual Instruction Tuning with 3D-Awareness	2025-04	Github
SpaRE: Enhancing Spatial Reasoning in Vision-Language Models with Synthetic Data	2025-04	-
Visual Agentic AI for Spatial Reasoning with a Dynamic API	2025-02	Github
Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding	2025-01	Github
SpatialCoT: Advancing Spatial Reasoning through Coordinate Alignment and Chain-of-Thought for Embodied Task Planning	2025-01	Github
LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Model	2024-12	Github
COARSE CORRESPONDENCES Boost Spatial-Temporal Reasoning in Multimodal Language Model	2024-08	Github
RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics	2024-06	GitHub
SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models	2024-06	Github
SpatialBot: Precise Spatial Understanding with Vision Language Models	2024-06	Github
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs	2024-04	Github
SpatialPIN: Enhancing Spatial Reasoning Capabilities of Vision-Language Models through Prompting and Interacting 3D Priors	2024-03	-
Can Transformers Capture Spatial Relations between Objects?	2024-03	Github
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities	2024-01	Github
Proximity QA: Unleashing the Power of Multi-Modal Large Language Models for Spatial Proximity Analysis	2024-01	Github
3DAxiesPrompts: Unleashing the 3D Spatial Task Capabilities of GPT-4V	2023-12	-

Text-based methods

Title	Date	Code
Imagine while Reasoning in Space: Multimodal Visualization-of-Thought	2025-01	-
Dspy-based Neural-Symbolic Pipeline to Enhance Spatial Reasoning in LLMs	2024-11	-
Sparkle: Mastering Basic Spatial Capabilities in Vision Language Models Elicits Generalization to Composite Spatial Reasoning	2024-10	-
SpaRC and SpaRP: Spatial Reasoning Characterization and Path Generation for Understanding Spatial Reasoning Capability of Large Language Models	2024-06	Github
Beyond Lines and Circles: Unveiling the Geometric Reasoning Gap in Large Language Models	2024-02	-
Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation and Enhancement Using the StepGame Benchmark	2024-01	Github

Datasets & Benchmarks

Visual-based data

Title	Date	Code
EscherVerse: An Open World Benchmark and Dataset for Teleo-Spatial Intelligence with Physical-Dynamic and Intent-Driven Understanding	2026-1	Github
RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics	2025-12	Github
Towards Cross-View Point Correspondence in Vision-Language Models	2025-12	Github
ORIGAMISPACE: Benchmarking Multimodal LLMs in Multi-Step Spatial Reasoning with Mathematical Constraints	2025-11	-
Scaling Spatial Intelligence with Multimodal Foundation Models	2025-11	Github
SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition	2025-11	Github
Visual Spatial Tuning	2025-11	Github
Spatial-DISE: A Unified Benchmark for Evaluating Spatial Reasoning in Vision-Language Models	2025-10	Github
DSI-Bench: A Benchmark for Dynamic Spatial Intelligence	2025-10	Github
Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes	2025-10	-
NavSpace: How Navigation Agents Follow Spatial Intelligence Instructions	2025-10	Github
SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs	2025-09	Github
Spatial Reasoning with Vision-Language Models in Ego-Centric Multi-View Scenes	2025-09	Github
Why Do MLLMs Struggle with Spatial Understanding? A Systematic Analysis from Data to Architecture	2025-09	Github
VisualTrans: A Benchmark for Real-World Visual Transformation Reasoning	2025-08	Github
SpatialVID: A Large-Scale Video Dataset with Spatial Annotations	2025-09	Github
VLM4D: Towards Spatiotemporal Awareness in Vision Language Models	2025-08	Github
11Plus-Bench: Demystifying Multimodal LLM Spatial Reasoning with Cognitive-Inspired Analysis	2025-08	-
Towards Scalable Spatial Intelligence via 2D-to-3D Data Lifting	2025-07	Github
Ascending the Infinite Ladder: Benchmarking Spatial Deformation Reasoning in Vision-Language Models	2025-07	-
SpatialViz-Bench: An MLLM Benchmark for Spatial Visualization	2025-07	Github
Spatial Mental Modeling from Limited Views	2025-06	Github
SIRI-Bench: Challenging VLMs' Spatial Intelligence through Complex Reasoning Tasks	2025-06	-
IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering	2025-06	Github
From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes	2025-06	Github
PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly	2025-06	Github
Can Vision Language Models Infer Human Gaze Direction? A Controlled Study	2025-06	Github
SpaCE-10: A Comprehensive Benchmark for Multimodal Large Language Models in Compositional Spatial Intelligence	2025-06	Github
Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations	2025-06	Github
OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models	2025-06	Github
InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models	2025-06	-
MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence	2025-05	Github
RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics	2025-05	Github
Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models	2025-05	Github
SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding	2025-05	Github
MIRAGE:A Multi-modal Benchmark for Spatial Perception, Reasoning, and Intelligence	2025-05	Github
Can Multimodal Large Language Models Understand Spatial Relations	2025-05	Github
Visuospatial Cognitive Assistant	2025-05	Github
Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning?	2025-05	Github
Vision language models have difficulty recognizing virtual objects	2025-05	-
ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models	2025-05	Github
Out of Sight, Not Out of Context? Egocentric Spatial Reasoning in VLMs Across Disjoint Frames	2025-05	-
SITE: towards Spatial Intelligence Thorough Evaluation	2025-05	Github
CameraBench: Towards Understanding Camera Motions in Any Video	2025-04	Github
Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs	2025-04	Github
From Flatland to Space:Teaching Vision-Language Models to Perceive and Reason in 3D	2025-03	Github
MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLM	2025-03	-
Open3DVQA: A Benchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space	2025-03	Github
STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?	2025-03	Github
CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language Models	2025-03	Github
Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models	2025-03	Github
LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?	2025-03	Github
Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models	2025-02	Github
FoREST: Frame of Reference Evaluation in Spatial Reasoning Tasks	2025-02	-
iVISPAR — An Interactive Visual-Spatial Reasoning Benchmark for VLMs	2025-02	Github
Defining and Evaluating Visual Language Models' Basic Spatial Abilities: A Perspective from Psychometrics	2025-02	-
SAT: Spatial Aptitude Training for Multimodal Language Models	2024-12	Github
SPHERE: A Hierarchical Evaluation on Spatial Perception and Reasoning for Vision-Language Models	2024-12	Github
3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark	2024-12	Github
Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces	2024-12	Github
RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics	2024-11	Github
An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models	2024-11	-
IKEA Manuals at Work: 4D Grounding of Assembly Instructions on Internet Videos	2024-11	Github
Is ‘Right’ Right? Enhancing Object Orientation Understanding in Multimodal Language Models through Egocentric Instruction Tuning	2024-10	Github
ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models	2024-10	Github
DOES SPATIAL COGNITION EMERGE IN FRONTIER MODELS?	2024-10	-
Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities	2024-10	Github
R2D3: Imparting Spatial Reasoning by Reconstructing 3D Scenes from 2D Images	2024-10	Github
Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models	2024-09	Github
Can Vision Language Models Learn from Visual Demonstrations of Ambiguous Spatial Reasoning?	2024-09	-
VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs	2024-07	Github
EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language Models	2024-06	Github
TopViewRS: Vision-Language Models as Top-View Spatial Reasoners	2024-06	Github
Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models	2024-06	Github
GSR-Bench: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMs	2024-06	-
Reframing Spatial Reasoning Evaluation in Language Models: A Real-World Simulation Benchmark for Qualitative Reasoning	2024-05	Github
Visually Descriptive Language Model for Vector Graphics Reasoning	2024-04	-
SQA3D: Situated Question Answering in 3D Scenes	2022-10	Github
Things not Written in Text: Exploring Spatial Commonsense from Visual Signals	2022-03	Github
SPARE3D: A Dataset for SPAtial REasoning on Three-View Line Drawings	2020-03	Github

Text-based data

Title	Introduction	Date	Code
Do Multimodal Language Models Really Understand Direction? A Benchmark for Compass Direction Reasoning		2024-12	-
GRASP: A Grid-Based Benchmark for Evaluating Commonsense Spatial Reasoning		2024-07	-

Findings

Title	Date	Code
Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks	2025-10	Github
How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective	2025-09	Github
Has GPT-5 Achieved Spatial Intelligence? An Empirical Study	2025-08	-
A Call for New Recipes to Enhance Spatial Reasoning in MLLMs	2025-03	Github
Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas	2025-03	Github
Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models	2025-03	Github

Applications

Title	Date	Code
Spatial Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model	2025-10	Github
SpatialPrompting: Keyframe-driven Zero-Shot Spatial Reasoning with Off-the-Shelf Multimodal Large Language Models	2025-05	-
InSpire: Vision-Language-Action Models with Intrinsic Spatial Reasoning	2025-05	Github
EmbodiedVSR: Dynamic Scene Graph-Guided Chain-of-Thought Reasoning for Visual Spatial Tasks	2025-03	-
SOFAR: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation	2025-02	Github
ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning	2025-03	Github
VL-Nav: Real-time Vision-Language Navigation with Spatial Reasoning	2025-02	-
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Models	2025-01	Github
Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection	2024-12	Github
EMMA-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning	2024-12	Github
Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning	2024-04	Github
Improving Vision-and-Language Reasoning via Spatial Relations Modeling	2023-11	-

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
imgs		imgs
.gitattributes		.gitattributes
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Spatial Intelligence in VLMs

Table of Contents

Methods

Visual-based methods

Text-based methods

Datasets & Benchmarks

Visual-based data

Text-based data

Findings

Applications

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Awesome Spatial Intelligence in VLMs

Table of Contents

Methods

Visual-based methods

Text-based methods

Datasets & Benchmarks

Visual-based data

Text-based data

Findings

Applications

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages