HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models

Official implementation of our CVPR 2026 paper: HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models

🔗 Project Page | 📄 Paper Link

TODO

Release checkpoint (before April 10, 2026)
Release training data and dataloader (before May 1, 2026)

Installation

# Clone the repository
git clone https://github.com/microsoft/HiSpatial.git
cd HiSpatial

# Install the package (core + evaluation dependencies)
pip install -e ".[eval]"

# Install MoGe depth estimator (required for inference)
pip install -e ".[depth]"

Inference

HiSpatial takes an RGB image and a 3D point cloud (estimated by MoGe) as input, and answers spatial reasoning questions.

from hispatial.inference import MoGeProcessor, HiSpatialPredictor

# Initialize MoGe depth estimator and HiSpatial predictor
moge = MoGeProcessor(device_name="cuda")
predictor = HiSpatialPredictor(model_load_path="path/to/weights.pt")

# Load an image (file path, PIL Image, or numpy array)
image = "example.jpg"

# Estimate 3D point cloud from the image
xyz_values = moge.apply_transform(image)

# Ask a spatial question
answer = predictor.query(
    image=image,
    prompt="Which object is closer to the camera, the chair or the table?",
    xyz_values=xyz_values,
)
print(answer)

Evaluation

We evaluate HiSpatial on 6 spatial understanding benchmarks. Each eval script can be run independently:

# CV-Bench (2D Relation + 3D)
python eval/eval_cv_bench.py \
    --vlm_model_path path/to/weights.pt \
    --save_path results/cvbench

# 3DSRBench
python eval/eval_3dsrbench.py \
    --vlm_model_path path/to/weights.pt \
    --tsv_path path/to/3DSRBenchv1.tsv \
    --save_path results/3dsrbench

# EmbSpatial
python eval/eval_emb_spatial.py \
    --vlm_model_path path/to/weights.pt \
    --save_path results/embspatial \
    --benchmark_path path/to/embspatial_bench.json

# Q-Spatial (QSpatial+ and QSpatial-ScanNet)
python eval/eval_q_spatial.py \
    --vlm_model_path path/to/weights.pt \
    --save_path results/qspatial \
    --scannet_images_dir path/to/scannet/images

# RoboSpatial
python eval/eval_robospatial.py \
    --vlm_model_path path/to/weights.pt \
    --save_path results/robospatial

# SpatialRGPT
python eval/eval_spatialrgpt.py \
    --vlm_model_path path/to/weights.pt \
    --save_path results/spatialrgpt.jsonl

Or run all benchmarks at once (edit paths in the script first):

bash eval/run_all.sh

Citation

@inproceedings{liang2026hispatial,
  title={HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models},
  author={Liang, Huizhi and Shen, Yichao and Deng, Yu and Xu, Sicheng and Feng, Zhiyuan and Zhang, Tong and Liang, Yaobo and Yang, Jiaolong},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
configs		configs
eval		eval
hispatial		hispatial
prismatic		prismatic
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models

TODO

Installation

Inference

Evaluation

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models

TODO

Installation

Inference

Evaluation

Citation

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages