NovaPlan Project Website

NovaPlan: Zero-Shot Long-Horizon Manipulation via Closed-Loop Video Language Planning

Jiahui Fu^*1, Junyu Nan^*1,2, Lingfeng Sun^*1, Hongyu Li^*1,3, Jianing Qian⁴, Jennifer Barry¹, Kris Kitani², George Konidaris³

¹Robotics and AI Institute ²Carnegie Mellon University ³Brown University ⁴University of Pennsylvania

*Equal contribution

[arXiv] [code (coming soon)]

🔇 Click to Unmute

Research Highlights

Solving long-horizon manipulation tasks requires integrating high-level semantic reasoning with physically grounded low-level control. We introduce NovaPlan, a hierarchical framework for zero-shot long-horizon manipulation that unifies closed-loop video language planning with geometrically grounded robot execution, with no prior demonstrations or task-specific training required.

Closed-Loop Video Language Planner: A VLM planner decomposes tasks into multi-step sub-goals, and a video generation model synthesizes task-solving videos as visual plans for robot action extraction. The system monitors execution in a closed loop, enabling recovery from single-step failures through autonomous re-planning.
Dual-Flow Action Extraction: To compute low-level robot actions, the system extracts object flow (task-relevant 3D object keypoint tracking) and hand flow (human hand pose estimation) from generated videos, and employs a switching mechanism to select the better flow reference, maintaining stable execution even under heavy occlusion or depth inaccuracy.
Strong Zero-Shot Performance: NovaPlan achieves strong zero-shot performance on diverse long-horizon tasks and the Functional Manipulation Benchmark (FMB), solving complex multi-step assembly tasks and performing non-prehensile error recovery via hand poking, all without any prior training.

Interactive Viewer: Object and Hand Flows

Flow Options:

3D Point Cloud & Actionable Object Flow from Generated Video

Note: The visualized end-effector is offset from the physical one due to longer gripper fingers used in real-world experiments.

Initial Observation

Generated Video

Execution Video (1x speed)

Methods

The full NovaPlan pipeline consists of a high-level planning system and a low-level execution system.

High-level Planning System: A high-level planner takes in the task instruction and current observation, and proposes multiple task-solving videos after VLM reasoning about the task progress. Videos are selected based on flow and semantic consistency. The low-level planner calculates the robot action from the extracted object or hand flow. Updated observations are sent to a VLM critic to determine whether the robot should proceed to the next step or recover from failure.

Low-level Execution System: Given the generated video plan (RGB+Depth), we switch between grounding the target object, tracking 3D keypoints, and recovering object flow, or estimating the hand pose with HaMeR, calibrating 3D scale, and computing the hand flow. The resulting object/hand flows are converted into robot actions for on-robot execution, with a video language planner enabling closed-loop re-planning and recovery.

Experiments

Step 1

Step 2

Step 3

Step 1

Step 2

Step 3

Step 1

Step 2

Step 3

Step 1

Step 2

Step 3

Step 4

Step 1

Step 2

Step 3

Step 4

Step 4
Recovery

Multi-Step Experiment Rollouts: We demonstrate NovaPlan on diverse long-horizon tasks. Use the tabs to view different experiments and the toggle to inspect object vs. hand flow.

Quantitative Results

Quantitative Results: NovaPlan achieves the highest long-horizon success rate on all three multi-step tasks, i.e., Four-layer Block Stacking, Color Sorting, and Hidden Object Search, outperforming both the leading VLA method (π_0.5) and VLM-based planning method (MOKA). By leveraging both object and hand flows for stability and non-prehensile recovery, NovaPlan demonstrates stable short-horizon task execution and robust recovery capabilities, maintaining strong per-step performance even as task complexity increases. For fair comparison, all compared methods (marked with ⁺) are provided with an oracle high-level plan.

Failure Modes

Failure Mode 1: Video Generation Failure

Failure Mode 2: Object Tracking Failure

Failure Mode 3: Hand Tracking Failure

Failure Mode 4: Execution Failure

Failure Mode 5: Inability to Reorient

Failure Modes in Different Modules: We identify five failure modes across the NovaPlan pipeline: (1) Video Generation Failure: The generated video does not follow the given action commands. (2) Object Tracking Failure: TAPIP3D fails to robustly track the object, leading to incorrect object flow extraction. (3) Hand Tracking Failure: HaMeR fails to robustly track the hand, leading to incorrect hand flow extraction. (4) Execution Failure: Inaccurate flow estimation leads to failed execution. (5) Inability to Reorient: Our existing regrasp and non-prehensile poking mechanisms cannot recover the end state of the green block, as it requires reorientation which our system cannot handle.