Skip to content

chenquan2002/LongVIL

Repository files navigation

LongVIL: Long-Horizon Visual Imitation Learning via Plan and Code Reflection

Authors
Quan Chen1,2★, Chenrui Shi1★, Qi Chen1,2, Yuwei Wu1,
Zhi Gao1, Xintong Zhang1, Rui Gao1,2, Kun Wu3, Yunde Jia2

★ Equal contribution

Affiliations
1 Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science and Technology, Beijing Institute of Technology

2 Guangdong Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University

3 Beijing Innovation Center of Humanoid Robotics

📄 Paper: Long-Horizon Visual Imitation Learning via Plan and Code Reflection
📊 Dataset: LongVILBench
🚀 Project: LongVIL-Agent


🔧 Environment Setup

Our experiments were conducted on:

  • Ubuntu 22.04.4 LTS
  • Python 3.10.14
  • PyTorch 2.4.0+cu121
  • NVIDIA 4090D GPU

Create the environment from the provided environment.yml:

conda env create -f environment.yml -n longvil
conda activate longvil

🎯 Keyframe Extraction (based on SeeDo)

We adopt and extend the keyframe extraction module from SeeDo.
This step must be done before running the benchmark.

  • Original SeeDo extracts keyframes for either left or right hand.
  • Our improved version allows extracting keyframes for both hands independently.

We provide a modified script (get_frames.py):

  1. Install GroundingDINO, SAM, and SAM2 as in SeeDo.
  2. Replace the original SeeDo script (get_frame_by_hands.py) with ours (get_frames.py).

📂 Dataset Preparation

Download LongVILBench.

Place the dataset under:

LongVIL/data
├── level1/
├── level2/
└── level3/

⚠️ The downloaded dataset (level1, level2 and level3) should replace the example folder inside LongVIL/data/.


🚀 Usage

Training & Inference

Run:

bash run.sh

⚠️ Edit run.sh to add your API key and specify the benchmark level.


📊 Evaluation

Run:

python evaluate.py

Edit the following lines in evaluate.py according to the target level:

OUTPUT_ROOT = Path("./output/data/level2")
GT_ROOT = Path("./data/level2")

Metrics

We provide three metrics:

  1. Exact Match Accuracy (EMA) – sequence match
  2. Step-wise Matching Score (SMS) – prefix match
  3. Final State Accuracy (FSA) – final state correctness

📖 Citation

If you use this work, please cite:

@misc{chen2025longhorizonvisualimitationlearning,
  title        = {Long-Horizon Visual Imitation Learning via Plan and Code Reflection},
  author       = {Quan Chen and Chenrui Shi and Qi Chen and Yuwei Wu and Zhi Gao and Xintong Zhang and Rui Gao and Kun Wu and Yunde Jia},
  year         = {2025},
  eprint       = {2509.05368},
  archivePrefix= {arXiv},
  primaryClass = {cs.RO},
  url          = {https://arxiv.org/abs/2509.05368}
}

📜 License

This project is released under the Apache-2.0 License.

About

Code for LongVIL: Long-Horizon Visual Imitation Learning via Plan and Code Reflection

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors