Authors
Quan Chen1,2★, Chenrui Shi1★, Qi Chen1,2, Yuwei Wu1,
Zhi Gao1, Xintong Zhang1, Rui Gao1,2, Kun Wu3, Yunde Jia2
★ Equal contribution
Affiliations
1 Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science and Technology, Beijing Institute of Technology
2 Guangdong Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University
3 Beijing Innovation Center of Humanoid Robotics
📄 Paper: Long-Horizon Visual Imitation Learning via Plan and Code Reflection
📊 Dataset: LongVILBench
🚀 Project: LongVIL-Agent
Our experiments were conducted on:
- Ubuntu 22.04.4 LTS
- Python 3.10.14
- PyTorch 2.4.0+cu121
- NVIDIA 4090D GPU
Create the environment from the provided environment.yml:
conda env create -f environment.yml -n longvil
conda activate longvilWe adopt and extend the keyframe extraction module from SeeDo.
This step must be done before running the benchmark.
- Original SeeDo extracts keyframes for either left or right hand.
- Our improved version allows extracting keyframes for both hands independently.
We provide a modified script (get_frames.py):
- Install GroundingDINO, SAM, and SAM2 as in SeeDo.
- Replace the original SeeDo script (
get_frame_by_hands.py) with ours (get_frames.py).
Download LongVILBench.
Place the dataset under:
LongVIL/data
├── level1/
├── level2/
└── level3/
level1, level2 and level3) should replace the example folder inside LongVIL/data/.
Run:
bash run.shrun.sh to add your API key and specify the benchmark level.
Run:
python evaluate.pyEdit the following lines in evaluate.py according to the target level:
OUTPUT_ROOT = Path("./output/data/level2")
GT_ROOT = Path("./data/level2")We provide three metrics:
- Exact Match Accuracy (EMA) – sequence match
- Step-wise Matching Score (SMS) – prefix match
- Final State Accuracy (FSA) – final state correctness
If you use this work, please cite:
@misc{chen2025longhorizonvisualimitationlearning,
title = {Long-Horizon Visual Imitation Learning via Plan and Code Reflection},
author = {Quan Chen and Chenrui Shi and Qi Chen and Yuwei Wu and Zhi Gao and Xintong Zhang and Rui Gao and Kun Wu and Yunde Jia},
year = {2025},
eprint = {2509.05368},
archivePrefix= {arXiv},
primaryClass = {cs.RO},
url = {https://arxiv.org/abs/2509.05368}
}This project is released under the Apache-2.0 License.