BOP-ASK: Object-Interaction Reasoning for Vision-Language Models

1 New York University    2 NVIDIA

Abstract

Vision–Language Models (VLMs) have achieved impressive performance on spatial reasoning benchmarks, yet these evaluations mask critical weaknesses in understanding object interactions. Current benchmarks test high-level relationships ("left of," "behind", etc.) but ignore fine-grained spatial understanding needed for real-world applications: precise 3D localization, physical compatibility between objects, object affordances and multi-step spatial planning. In this work, we present BOP-ASK, a novel large-scale dataset for object-interaction reasoning for both training and benchmarking. Our data generation pipeline leverages 6D object poses from the Benchmark for Object Pose Estimation (BOP) datasets from which we derive fine-grained annotations such as grasp poses, referred object poses, path planning trajectories, relative spatial and depth relationships, and object-to-object relationships. BOP-ASK comprises over 150k images and 33M question–answer pairs spanning six tasks (four novel), providing a rich resource for training and evaluating VLMs. We evaluate proprietary and open-sourced VLMs, and conduct human evaluations on BOP-ASK-core, a contributed test benchmark. We also release BOP-ASK-lab, an out-of-distribution benchmark with images not sourced from BOP, enabling testing of generalization. Our experiments demonstrate that models trained on BOP-ASK outperform baselines and exhibit emergent capabilities such as precise object and grasp pose estimation, trajectory planning, and fine-grained object-centric spatial reasoning in cluttered environments. We will publicly release our datasets and dataset generation pipeline.

Key Findings

Dataset Details

BibTeX

@article{bhat2025bop,
  title={BOP-ASK: Object-Interaction Reasoning for Vision-Language Models},
  author={Bhat, Vineet and Kim, Sungsu and Blukis, Valts and Heinrich, Greg and Krishnamurthy, Prashanth and Karri, Ramesh and Birchfield, Stan and Khorrami, Farshad and Tremblay, Jonathan},
  journal={arXiv preprint arXiv:2511.16857},
  year={2025}
}