SpatialPi: 3D-Enhanced Robot Learning with Point Cloud Representations

¹Carnegie Mellon University

^*Equal Contribution

Abstract

Vision-language-action (VLA) models have shown remarkable success in robot manipulation tasks by learning from large-scale demonstrations. However, most existing approaches rely on 2D image representations, which inherently lack rich spatial understanding. In this work, we introduce SpatialPi, a comprehensive toolkit that bridges the gap between raw robot demonstrations and powerful 3D-enhanced learning. SpatialPi transforms robot demonstrations into rich 3D point cloud datasets and enables fine-tuning of π-models (π₀) with enhanced spatial representations.

Our pipeline combines cutting-edge 3D perception with scalable data processing to convert RLDS-format episodes into LeRobot datasets enriched with per-frame point clouds. The system features a fault-tolerant, resumable two-stage conversion process with multi-GPU support, automatic schema repair, and seamless integration with Physical Intelligence's OpenPI framework. We demonstrate the effectiveness of our approach on the LIBERO benchmark, showing improved performance when training with point cloud-enhanced representations compared to standard 2D image-based training.

Pipeline Overview

Stage 0: Metadata

Enumerate all RLDS episodes and create LeRobot dataset structure

Stage 2: Point Cloud Generation

Generate 3D point clouds using VGGT depth estimation

Stage 3: Model Training

Fine-tune π₀ models on point cloud-enhanced datasets

📈 Evaluation Results

We fine-tuned the π₀ base model on the LIBERO dataset and evaluated on the LIBERO benchmark. Results show that point cloud-enhanced training (π₀_pc) outperforms standard 2D image-based training (π₀) across all task suites.

Model	Libero Spatial	Libero Object	Libero Goal	Libero 10	Average
π₀_pc @ 5K	81.8%	93.4%	73.8%	55.2%	76.5%
π₀ @ 5K	76.4%	92.2%	70.8%	54.2%	73.4%
π₀_pc @ 30K	98.2%	97.4%	94.0%	89.6%	94.8%
π₀_libero	96.8%	98.0%	93.4%	81.0%	92.3%
π₀_base	0.0%	0.0%	0.0%	0.0%	0.0%

BibTeX

@misc{spatialpi,
  title={SpatialPi: 3D-Enhanced Robot Learning with Point Cloud Representations},
  author={Zhang, Yidi and Jangir, Yash},
  year={2025}
}

Acknowledgments

We gratefully acknowledge the following projects and teams:

Physical Intelligence
OpenPI framework and π-models

Meta AI
VGGT depth estimation

LIBERO Team
Comprehensive manipulation benchmark

HuggingFace
LeRobot and dataset infrastructure