Solving long-horizon manipulation tasks requires integrating high-level semantic reasoning with physically grounded low-level control. We introduce NovaPlan, a hierarchical framework for zero-shot long-horizon manipulation that unifies closed-loop video language planning with geometrically grounded robot execution, with no prior demonstrations or task-specific training required.
- Closed-Loop Video Language Planner: A VLM planner decomposes tasks into multi-step sub-goals, and a video generation model synthesizes task-solving videos as visual plans for robot action extraction. The system monitors execution in a closed loop, enabling recovery from single-step failures through autonomous re-planning.
- Dual-Flow Action Extraction: To compute low-level robot actions, the system extracts object flow (task-relevant 3D object keypoint tracking) and hand flow (human hand pose estimation) from generated videos, and employs a switching mechanism to select the better flow reference, maintaining stable execution even under heavy occlusion or depth inaccuracy.
- Strong Zero-Shot Performance: NovaPlan achieves strong zero-shot performance on diverse long-horizon tasks and the Functional Manipulation Benchmark (FMB), solving complex multi-step assembly tasks and performing non-prehensile error recovery via hand poking, all without any prior training.
3D Point Cloud & Actionable Object Flow from Generated Video
Note: The visualized end-effector is offset from the physical one due to longer gripper fingers used in real-world experiments.
Initial Observation
Generated Video
Execution Video (1x speed)
The full NovaPlan pipeline consists of a high-level planning system and a low-level execution
system.
High-level Planning System: A high-level planner takes in the task instruction and current observation,
and proposes multiple task-solving videos after VLM reasoning about the task progress. Videos are selected
based on flow and semantic consistency. The low-level planner calculates the robot action from the extracted
object or hand flow. Updated observations are sent to a VLM critic to determine whether the robot should
proceed to the next step or recover from failure.
Low-level Execution System: Given the generated video plan (RGB+Depth), we switch between grounding the
target object, tracking 3D keypoints, and recovering object flow, or estimating the hand pose with HaMeR,
calibrating 3D scale, and computing the hand flow. The resulting object/hand flows are converted into robot
actions for on-robot execution, with a video language planner enabling closed-loop re-planning and recovery.



Multi-Step Experiment Rollouts: We demonstrate NovaPlan on diverse long-horizon tasks. Use the tabs to view different experiments and the toggle to inspect object vs. hand flow.
Quantitative Results: NovaPlan achieves the highest long-horizon success rate on all three multi-step tasks, i.e., Four-layer Block Stacking, Color Sorting, and Hidden Object Search, outperforming both the leading VLA method (π0.5) and VLM-based planning method (MOKA). By leveraging both object and hand flows for stability and non-prehensile recovery, NovaPlan demonstrates stable short-horizon task execution and robust recovery capabilities, maintaining strong per-step performance even as task complexity increases. For fair comparison, all compared methods (marked with ⁺) are provided with an oracle high-level plan.
Failure Modes in Different Modules: We identify five failure modes across the NovaPlan pipeline: (1) Video Generation Failure: The generated video does not follow the given action commands. (2) Object Tracking Failure: TAPIP3D fails to robustly track the object, leading to incorrect object flow extraction. (3) Hand Tracking Failure: HaMeR fails to robustly track the hand, leading to incorrect hand flow extraction. (4) Execution Failure: Inaccurate flow estimation leads to failed execution. (5) Inability to Reorient: Our existing regrasp and non-prehensile poking mechanisms cannot recover the end state of the green block, as it requires reorientation which our system cannot handle.













