Yidi Zhang1*, Yash Jangir1*
1Carnegie Mellon University
*Equal Contribution
Vision-language-action (VLA) models have shown remarkable success in robot manipulation tasks by learning from large-scale demonstrations. However, most existing approaches rely on 2D image representations, which inherently lack rich spatial understanding. In this work, we introduce SpatialPi, a comprehensive toolkit that bridges the gap between raw robot demonstrations and powerful 3D-enhanced learning. SpatialPi transforms robot demonstrations into rich 3D point cloud datasets and enables fine-tuning of π-models (π₀) with enhanced spatial representations.
Our pipeline combines cutting-edge 3D perception with scalable data processing to convert RLDS-format episodes into LeRobot datasets enriched with per-frame point clouds. The system features a fault-tolerant, resumable two-stage conversion process with multi-GPU support, automatic schema repair, and seamless integration with Physical Intelligence's OpenPI framework. We demonstrate the effectiveness of our approach on the LIBERO benchmark, showing improved performance when training with point cloud-enhanced representations compared to standard 2D image-based training.
Enumerate all RLDS episodes and create LeRobot dataset structure
Generate 3D point clouds using VGGT depth estimation
Fine-tune π₀ models on point cloud-enhanced datasets
We fine-tuned the π₀ base model on the LIBERO dataset and evaluated on the LIBERO benchmark. Results show that point cloud-enhanced training (π₀_pc) outperforms standard 2D image-based training (π₀) across all task suites.
| Model | Libero Spatial | Libero Object | Libero Goal | Libero 10 | Average |
|---|---|---|---|---|---|
| π₀_pc @ 5K | 81.8% | 93.4% | 73.8% | 55.2% | 76.5% |
| π₀ @ 5K | 76.4% | 92.2% | 70.8% | 54.2% | 73.4% |
| π₀_pc @ 30K | 98.2% | 97.4% | 94.0% | 89.6% | 94.8% |
| π₀_libero | 96.8% | 98.0% | 93.4% | 81.0% | 92.3% |
| π₀_base | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% |
@misc{spatialpi,
title={SpatialPi: 3D-Enhanced Robot Learning with Point Cloud Representations},
author={Zhang, Yidi and Jangir, Yash},
year={2025}
}
We gratefully acknowledge the following projects and teams: