NovaPlan: Zero-Shot Long-Horizon Manipulation via Closed-Loop Video Language Planning
Jiahui Fu*1, Junyu Nan*1,2, Lingfeng Sun*1, Hongyu Li*1,3, Jianing Qian4, Jennifer Barry1, Kris Kitani2, George Konidaris3
1Robotics and AI Institute   2Carnegie Mellon University   3Brown University   4University of Pennsylvania

*Equal contribution
🔇 Click to Unmute
Research Highlights

Solving long-horizon manipulation tasks requires integrating high-level semantic reasoning with physically grounded low-level control. We introduce NovaPlan, a hierarchical framework for zero-shot long-horizon manipulation that unifies closed-loop video language planning with geometrically grounded robot execution, with no prior demonstrations or task-specific training required.

  • Closed-Loop Video Language Planner: A VLM planner decomposes tasks into multi-step sub-goals, and a video generation model synthesizes task-solving videos as visual plans for robot action extraction. The system monitors execution in a closed loop, enabling recovery from single-step failures through autonomous re-planning.
  • Dual-Flow Action Extraction: To compute low-level robot actions, the system extracts object flow (task-relevant 3D object keypoint tracking) and hand flow (human hand pose estimation) from generated videos, and employs a switching mechanism to select the better flow reference, maintaining stable execution even under heavy occlusion or depth inaccuracy.
  • Strong Zero-Shot Performance: NovaPlan achieves strong zero-shot performance on diverse long-horizon tasks and the Functional Manipulation Benchmark (FMB), solving complex multi-step assembly tasks and performing non-prehensile error recovery via hand poking, all without any prior training.
Interactive Viewer: Object and Hand Flows

3D Point Cloud & Actionable Object Flow from Generated Video

Note: The visualized end-effector is offset from the physical one due to longer gripper fingers used in real-world experiments.

Initial Observation

Initial Observation

Generated Video

Execution Video (1x speed)

Methods

The full NovaPlan pipeline consists of a high-level planning system and a low-level execution system.

High-level Planning System: A high-level planner takes in the task instruction and current observation, and proposes multiple task-solving videos after VLM reasoning about the task progress. Videos are selected based on flow and semantic consistency. The low-level planner calculates the robot action from the extracted object or hand flow. Updated observations are sent to a VLM critic to determine whether the robot should proceed to the next step or recover from failure.

Low-level Execution System: Given the generated video plan (RGB+Depth), we switch between grounding the target object, tracking 3D keypoints, and recovering object flow, or estimating the hand pose with HaMeR, calibrating 3D scale, and computing the hand flow. The resulting object/hand flows are converted into robot actions for on-robot execution, with a video language planner enabling closed-loop re-planning and recovery.

Experiments
Quantitative Results
Failure Modes