Skip to content

mll-lab-nu/Awesome-Spatial-Intelligence-in-VLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

82 Commits
 
 
 
 
 
 

Repository files navigation

Awesome Spatial Intelligence in VLMs

Image Image

This carefully curated list brings together key methods, datasets, and benchmarks in the field of spatial intelligence for VLMs.

With the development of multimodal models, evaluating and enhancing their spatial intelligence has become a key research frontier. This list aims to provide researchers and engineers with a quick index to track the latest advancements in the field.

We welcome contributions of excellent resources you find via Pull Request!

Table of Contents

Methods

Visual-based methods

Title Introduction Date Code

RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics
image 2025-12 Github

N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models
image 2025-12 Github

SpatialDreamer: Incentivizing Spatial Reasoning via Active Mental Imagery
image 2025-12 Github

Seeing through Imagination: Learning Scene Geometry via Implicit Spatial World Modeling
image 2025-12 Github

CVP: Central-Peripheral Vision-Inspired Multimodal Model for Spatial Reasoning
image 2025-12 -

S2-MLLM: Boosting Spatial Reasoning Capability of MLLMs for 3D Visual Grounding with Structural Guidance
image 2025-12 -

EagleVision: A Dual-Stage Framework with BEV-grounding-based Chain-of-Thought for Spatial Intelligence
image 2025-12 -

Video4Spatial: Towards Visuospatial Intelligence with Context-Guided Video Generation
image 2025-12 -

Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning
image 2025-11 Github

G2VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning
image 2025-11 Github

Video Spatial Reasoning with Object-Centric 3D Rollout
image 2025-11 -

Abstract 3D Perception for Spatial Intelligence in Vision-Language Models
image 2025-11 -

SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models
image 2025-11 -

Cambrian-S: Towards Spatial Supersensing in Video
image 2025-11 Github

Omni-View: Unlocking How Generation Facilitates Understanding in Unified 3D Model based on Multiview images
image 2025-11 Github

SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards
image 2025-11 Github

TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics
image 2025-10 Github

Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning
image 2025-10 Github

Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation
image 2025-10 Github

Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views
image 2025-10 Github

Euclid’s Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks
image 2025-10 Github

SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models
image 2025-10 Github

SpaceVista: All-Scale Visual Spatial Reasoning from mm to km
image 2025-10 Github
Publish
SD-VLM: Spatial Measuring and Understanding with Depth-Encoded Vision-Language Models
image 2025-09 Github
Publish
See&Trek: Training-Free Spatial Prompting for Multimodal Large Language Model
image 2025-09 -

3D Aware Region Prompted Vision Language Model
image 2025-09 Github

UniUGG: Unified 3D Understanding and Generation via Geometric-Semantic Encoding
image 2025-08 Github

SIFThinker: Spatially-Aware Image Focus for Visual Reasoning
image 2025-08 Github

Enhancing Spatial Reasoning through Visual and Textual Thinking
image 2025-07 -
Publish
MindJourney: Test-Time Scaling with World Models for Spatial Reasoning
image 2025-07 Github
Publish
Struct2D: A Perception-Guided Framework for Spatial Reasoning in Large Multimodal Models
image 2025-06 Github
Publish
Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs
image 2025-06 Github
Publish
SpatialLM: Training Large Language Models for Structured Indoor Modeling
image 2025-06 Github

Spatial Understanding from Videos: Structured Prompts Meet Simulation Data
image 2025-06 Github

Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing
image 2025-06 Github

Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces
image 2025-05 Github
Publish
RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics
image 2025-05 Github
Publish
Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors
image 2025-05 Github
Publish
SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models
image 2025-05 Github
Publish
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
image 2025-05 Github

STAR-R1: Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs
image 2025-05 Github

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction
image 2025-05 Github

LLaVA-4D: Embedding SpatioTemporal Prompt into LMMs for 4D Scene Understanding
image 2025-05 -
Publish
SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning
image 2025-05 Github
Publish
Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation
image 2025-04 Github

Scaling and Beyond: Advancing Spatial Reasoning in MLLMs Requires New Recipe
image 2025-04 -

SpaceR: Reinforcing MLLMs in Video Spatial Reasoning
image 2025-04 Github

Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning
image 2025-04 Github
Publish
SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning
image 2025-04 Github
Publish
ROSS3D: Reconstructive Visual Instruction Tuning with 3D-Awareness
image 2025-04 Github
Publish
SpaRE: Enhancing Spatial Reasoning in Vision-Language Models with Synthetic Data
image 2025-04 -

Visual Agentic AI for Spatial Reasoning with a Dynamic API
image 2025-02 Github
Publish
Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding
image 2025-01 Github

SpatialCoT: Advancing Spatial Reasoning through Coordinate Alignment and Chain-of-Thought for Embodied Task Planning
image 2025-01 Github
Publish
LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Model
image 2024-12 Github

COARSE CORRESPONDENCES Boost Spatial-Temporal Reasoning in Multimodal Language Model
image 2024-08 Github
Star Publish
RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics
image 2024-06 GitHub
Star Publish
SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models
image 2024-06 Github
Star Publish
SpatialBot: Precise Spatial Understanding with Vision Language Models
image 2024-06 Github
Publish
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
image 2024-04 Github

SpatialPIN: Enhancing Spatial Reasoning Capabilities of Vision-Language Models through Prompting and Interacting 3D Priors
image 2024-03 -
Publish
Can Transformers Capture Spatial Relations between Objects?
image 2024-03 Github
Star Publish
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
image 2024-01 Github

Proximity QA: Unleashing the Power of Multi-Modal Large Language Models for Spatial Proximity Analysis
image 2024-01 Github

3DAxiesPrompts: Unleashing the 3D Spatial Task Capabilities of GPT-4V
image 2023-12 -

Text-based methods

Title Introduction Date Code

Imagine while Reasoning in Space: Multimodal Visualization-of-Thought
image 2025-01 -

Dspy-based Neural-Symbolic Pipeline to Enhance Spatial Reasoning in LLMs
image 2024-11 -
Publish
Sparkle: Mastering Basic Spatial Capabilities in Vision Language Models Elicits Generalization to Composite Spatial Reasoning
image 2024-10 -
Publish
SpaRC and SpaRP: Spatial Reasoning Characterization and Path Generation for Understanding Spatial Reasoning Capability of Large Language Models
image 2024-06 Github
Publish
Beyond Lines and Circles: Unveiling the Geometric Reasoning Gap in Large Language Models
image 2024-02 -
Publish
Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation and Enhancement Using the StepGame Benchmark
image 2024-01 Github

Datasets & Benchmarks

Visual-based data

Title Introduction Date Code

EscherVerse: An Open World Benchmark and Dataset for Teleo-Spatial Intelligence with Physical-Dynamic and Intent-Driven Understanding
image 2026-1 Github

RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics
image 2025-12 Github

Towards Cross-View Point Correspondence in Vision-Language Models
image 2025-12 Github
Publish
ORIGAMISPACE: Benchmarking Multimodal LLMs in Multi-Step Spatial Reasoning with Mathematical Constraints
image 2025-11 -

Scaling Spatial Intelligence with Multimodal Foundation Models
image 2025-11 Github

SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition
image 2025-11 Github

Visual Spatial Tuning
image 2025-11 Github

Spatial-DISE: A Unified Benchmark for Evaluating Spatial Reasoning in Vision-Language Models
image 2025-10 Github

DSI-Bench: A Benchmark for Dynamic Spatial Intelligence
image 2025-10 Github

Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes
image 2025-10 -

NavSpace: How Navigation Agents Follow Spatial Intelligence Instructions
image 2025-10 Github

SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs
image 2025-09 Github

Spatial Reasoning with Vision-Language Models in Ego-Centric Multi-View Scenes
image 2025-09 Github

Why Do MLLMs Struggle with Spatial Understanding? A Systematic Analysis from Data to Architecture
image 2025-09 Github

VisualTrans: A Benchmark for Real-World Visual Transformation Reasoning
image 2025-08 Github

SpatialVID: A Large-Scale Video Dataset with Spatial Annotations
image 2025-09 Github
Publish
VLM4D: Towards Spatiotemporal Awareness in Vision Language Models
image 2025-08 Github

11Plus-Bench: Demystifying Multimodal LLM Spatial Reasoning with Cognitive-Inspired Analysis
image 2025-08 -
Publish
Towards Scalable Spatial Intelligence via 2D-to-3D Data Lifting
image 2025-07 Github

Ascending the Infinite Ladder: Benchmarking Spatial Deformation Reasoning in Vision-Language Models
image 2025-07 -

SpatialViz-Bench: An MLLM Benchmark for Spatial Visualization
image 2025-07 Github

Spatial Mental Modeling from Limited Views
image 2025-06 Github

SIRI-Bench: Challenging VLMs' Spatial Intelligence through Complex Reasoning Tasks
image 2025-06 -
Publish
IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering
image 2025-06 Github
Publish
From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes
image 2025-06 Github
Publish
PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly
image 2025-06 Github

Can Vision Language Models Infer Human Gaze Direction? A Controlled Study
image 2025-06 Github

SpaCE-10: A Comprehensive Benchmark for Multimodal Large Language Models in Compositional Spatial Intelligence
image 2025-06 Github

Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations
image 2025-06 Github

OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models
image 2025-06 Github

InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models
image 2025-06 -

MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence
image 2025-05 Github
Publish
RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics
image 2025-05 Github

Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models
image 2025-05 Github

SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding
image 2025-05 Github

MIRAGE:A Multi-modal Benchmark for Spatial Perception, Reasoning, and Intelligence
image 2025-05 Github

Can Multimodal Large Language Models Understand Spatial Relations
image 2025-05 Github

Visuospatial Cognitive Assistant
image 2025-05 Github

Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning?
image 2025-05 Github

Vision language models have difficulty recognizing virtual objects
image 2025-05 -

ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models
image 2025-05 Github

Out of Sight, Not Out of Context? Egocentric Spatial Reasoning in VLMs Across Disjoint Frames
image 2025-05 -
Publish
SITE: towards Spatial Intelligence Thorough Evaluation
image 2025-05 Github

CameraBench: Towards Understanding Camera Motions in Any Video
image 2025-04 Github

Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs
image 2025-04 Github

From Flatland to Space:Teaching Vision-Language Models to Perceive and Reason in 3D
image 2025-03 Github

MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLM
image 2025-03 -

Open3DVQA: A Benchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space
image 2025-03 Github
Publish
STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?
image 2025-03 Github
Publish
CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language Models
image 2025-03 Github

Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models
image 2025-03 Github

LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?
image 2025-03 Github
Publish
Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models
image 2025-02 Github

FoREST: Frame of Reference Evaluation in Spatial Reasoning Tasks
image 2025-02 -

iVISPAR — An Interactive Visual-Spatial Reasoning Benchmark for VLMs
image 2025-02 Github

Defining and Evaluating Visual Language Models' Basic Spatial Abilities: A Perspective from Psychometrics
image 2025-02 -
Publish
SAT: Spatial Aptitude Training for Multimodal Language Models
image 2024-12 Github
Publish
SPHERE: A Hierarchical Evaluation on Spatial Perception and Reasoning for Vision-Language Models
image 2024-12 Github
Publish
3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark
image 2024-12 Github
StarPublish
Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces
image 2024-12 Github
Publish
RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics
image 2024-11 Github
Publish
An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models
image 2024-11 -
Publish
IKEA Manuals at Work: 4D Grounding of Assembly Instructions on Internet Videos
image 2024-11 Github

Is ‘Right’ Right? Enhancing Object Orientation Understanding in Multimodal Language Models through Egocentric Instruction Tuning
image 2024-10 Github
Publish
ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models
image 2024-10 Github
Publish
DOES SPATIAL COGNITION EMERGE IN FRONTIER MODELS?
image 2024-10 -
Publish
Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities
image 2024-10 Github

R2D3: Imparting Spatial Reasoning by Reconstructing 3D Scenes from 2D Images
image 2024-10 Github
Publish
Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models
image 2024-09 Github
Publish
Can Vision Language Models Learn from Visual Demonstrations of Ambiguous Spatial Reasoning?
image 2024-09 -

VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs
image 2024-07 Github
Publish
EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language Models
image 2024-06 Github
Publish
TopViewRS: Vision-Language Models as Top-View Spatial Reasoners
image 2024-06 Github
Publish
Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models
image 2024-06 Github
Publish
GSR-Bench: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMs
image 2024-06 -
Publish
Reframing Spatial Reasoning Evaluation in Language Models: A Real-World Simulation Benchmark for Qualitative Reasoning
image 2024-05 Github
Publish
Visually Descriptive Language Model for Vector Graphics Reasoning
image 2024-04 -
PublishStar
SQA3D: Situated Question Answering in 3D Scenes
image 2022-10 Github
PublishStar
Things not Written in Text: Exploring Spatial Commonsense from Visual Signals
image 2022-03 Github
Publish
SPARE3D: A Dataset for SPAtial REasoning on Three-View Line Drawings
image 2020-03 Github

Text-based data

Title Introduction Date Code

Do Multimodal Language Models Really Understand Direction? A Benchmark for Compass Direction Reasoning
image 2024-12 -

GRASP: A Grid-Based Benchmark for Evaluating Commonsense Spatial Reasoning
image 2024-07 -

Findings

Title Introduction Date Code

Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks
image 2025-10 Github

How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective
image 2025-09 Github

Has GPT-5 Achieved Spatial Intelligence? An Empirical Study
image 2025-08 -

A Call for New Recipes to Enhance Spatial Reasoning in MLLMs
image 2025-03 Github

Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas
image 2025-03 Github

Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models
image 2025-03 Github

Applications

Title Introduction Date Code

Spatial Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model
image 2025-10 Github

SpatialPrompting: Keyframe-driven Zero-Shot Spatial Reasoning with Off-the-Shelf Multimodal Large Language Models
image 2025-05 -

InSpire: Vision-Language-Action Models with Intrinsic Spatial Reasoning
image 2025-05 Github

EmbodiedVSR: Dynamic Scene Graph-Guided Chain-of-Thought Reasoning for Visual Spatial Tasks
image 2025-03 -

SOFAR: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation
image 2025-02 Github

ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning
image 2025-03 Github

VL-Nav: Real-time Vision-Language Navigation with Spatial Reasoning
image 2025-02 -

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Models
image 2025-01 Github
Publish
Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection
image 2024-12 Github

EMMA-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning
image 2024-12 Github
Publish
Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning
image 2024-04 Github
Publish
Improving Vision-and-Language Reasoning via Spatial Relations Modeling
image 2023-11 -

About

A paper list for spatial reasoning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors