LongVIL: Long-Horizon Visual Imitation Learning via Plan and Code Reflection

Authors
Quan Chen^1,2★, Chenrui Shi^1★, Qi Chen^1,2, Yuwei Wu¹,
Zhi Gao¹, Xintong Zhang¹, Rui Gao^1,2, Kun Wu³, Yunde Jia²

★ Equal contribution

Affiliations
¹ Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science and Technology, Beijing Institute of Technology

² Guangdong Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University

³ Beijing Innovation Center of Humanoid Robotics

📄 Paper: Long-Horizon Visual Imitation Learning via Plan and Code Reflection
📊 Dataset: LongVILBench
🚀 Project: LongVIL-Agent

🔧 Environment Setup

Our experiments were conducted on:

Ubuntu 22.04.4 LTS
Python 3.10.14
PyTorch 2.4.0+cu121
NVIDIA 4090D GPU

Create the environment from the provided environment.yml:

conda env create -f environment.yml -n longvil
conda activate longvil

🎯 Keyframe Extraction (based on SeeDo)

We adopt and extend the keyframe extraction module from SeeDo.
This step must be done before running the benchmark.

Original SeeDo extracts keyframes for either left or right hand.
Our improved version allows extracting keyframes for both hands independently.

We provide a modified script (get_frames.py):

Install GroundingDINO, SAM, and SAM2 as in SeeDo.
Replace the original SeeDo script (get_frame_by_hands.py) with ours (get_frames.py).

📂 Dataset Preparation

Download LongVILBench.

Place the dataset under:

LongVIL/data
├── level1/
├── level2/
└── level3/

⚠️ The downloaded dataset (level1, level2 and level3) should replace the example folder inside LongVIL/data/.

🚀 Usage

Training & Inference

Run:

bash run.sh

⚠️ Edit run.sh to add your API key and specify the benchmark level.

📊 Evaluation

Run:

python evaluate.py

Edit the following lines in evaluate.py according to the target level:

OUTPUT_ROOT = Path("./output/data/level2")
GT_ROOT = Path("./data/level2")

Metrics

We provide three metrics:

Exact Match Accuracy (EMA) – sequence match
Step-wise Matching Score (SMS) – prefix match
Final State Accuracy (FSA) – final state correctness

📖 Citation

If you use this work, please cite:

@misc{chen2025longhorizonvisualimitationlearning,
  title        = {Long-Horizon Visual Imitation Learning via Plan and Code Reflection},
  author       = {Quan Chen and Chenrui Shi and Qi Chen and Yuwei Wu and Zhi Gao and Xintong Zhang and Rui Gao and Kun Wu and Yunde Jia},
  year         = {2025},
  eprint       = {2509.05368},
  archivePrefix= {arXiv},
  primaryClass = {cs.RO},
  url          = {https://arxiv.org/abs/2509.05368}
}

📜 License

This project is released under the Apache-2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
agent		agent
data/example/example		data/example/example
engines		engines
output/data/example		output/data/example
prompts		prompts
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
evalute.py		evalute.py
get_frames.py		get_frames.py
hand_landmarker.task		hand_landmarker.task
run.py		run.py
run.sh		run.sh
run_get_frames.sh		run_get_frames.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LongVIL: Long-Horizon Visual Imitation Learning via Plan and Code Reflection

🔧 Environment Setup

🎯 Keyframe Extraction (based on SeeDo)

📂 Dataset Preparation

🚀 Usage

Training & Inference

📊 Evaluation

Metrics

📖 Citation

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LongVIL: Long-Horizon Visual Imitation Learning via Plan and Code Reflection

🔧 Environment Setup

🎯 Keyframe Extraction (based on SeeDo)

📂 Dataset Preparation

🚀 Usage

Training & Inference

📊 Evaluation

Metrics

📖 Citation

📜 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages