Skip to content

InternRobotics/InternVLA-A1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

InternVLA-A1: Unifying Understanding, Generation, and Action for Robotic Manipulation​

image description

Paper Data Data Website

🔥 Highlights

InternVLA-A1 unifies scene understanding, visual foresight generation, and action execution into a single framework.

  • 🔮 The Core: Synergizes MLLM's semantic understanding with world-model-style dynamic prediction, enabling it to "imagine" the future and guide adaptive actions.
  • 🚀 The Fuel: Empowered by high-fidelity synthetic data (InternData-A1).
  • The Output: Tackles highly dynamic scenarios with effortless mastery.
express_sorting.mp4
parcel_handling.mp4
Overcooked.mp4
sort_parts.mp4
zig_bag.mp4
unscrew_cap.mp4

📅 TODO List

  • Release InternVLA-A1-3B
  • Add quick-start for fine-tuning on lerobot/pusht
  • 🔥NEW!!! Release guideline of large-scale dataset pretraining at "tutorials"
  • Release InternVLA-A1-2B

📑 Table of Contents

🛠️ Installation

This repository has been tested on Python 3.10 and CUDA 12.8. We recommend using conda to create an isolated environment.

1. Create Conda Environment

conda create -y -n internvla_a1 python=3.10
conda activate internvla_a1

pip install --upgrade pip

2. Install System Dependencies

We use FFmpeg for video encoding/decoding and SVT-AV1 for efficient storage.

conda install -c conda-forge ffmpeg=7.1.1 svt-av1 -y

3. Install PyTorch (CUDA 12.8)

pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 \
  --index-url https://download.pytorch.org/whl/cu128

4. Install Python Dependencies

pip install torchcodec numpy scipy transformers==4.57.1 mediapy loguru pytest omegaconf
pip install -e .

5. Patch HuggingFace Transformers

We replace the default implementations of several model modules (e.g., π0, InternVLA_A1_3B, InternVLA_A1_2B) to support custom architectures for robot learning.

TRANSFORMERS_DIR=${CONDA_PREFIX}/lib/python3.10/site-packages/transformers/

cp -r src/lerobot/policies/pi0/transformers_replace/models        ${TRANSFORMERS_DIR}
cp -r src/lerobot/policies/InternVLA_A1_3B/transformers_replace/models  ${TRANSFORMERS_DIR}
cp -r src/lerobot/policies/InternVLA_A1_2B/transformers_replace/models  ${TRANSFORMERS_DIR}

Make sure the target directory exists—otherwise create it manually.

6. Configure Environment Variables

export HF_TOKEN=your_token  # for downloading hf models, tokenizers, or processors
export HF_HOME=path_to_huggingface   # default: ~/.cache/huggingface

7. Link Local HuggingFace Cache

ln -s ${HF_HOME}/lerobot data

This allows the repo to access datasets via ./data/.


🕹️ Playground

Quick start with lerobot/pusht

One-line command

bash launch/internvla_a1_3b_finetune.sh lerobot/pusht abs false

Here, abs indicates using absolute actions, and false means that the training script will use the statistics file (stats.json) provided by lerobot/pusht itself.


🎯 Fine-tuning

This section provides a tutorial for fine-tuning InternVLA-A1-3B with InternData-A1 real dataset: download a dataset → convert it to v3.0 format → fine-tune InternVLA-A1-3B on the A2D Pick-Pen task.


1. Prepare the post-training dataset

In this example, we use the A2D Pick-Pen task from the Genie-1 real-robot dataset.

Step 1.1 Download the dataset from Hugging Face

hf download \
  InternRobotics/InternData-A1 \
  real/genie1/Put_the_pen_from_the_table_into_the_pen_holder.tar.gz \
  --repo-type dataset \
  --local-dir data

Step 1.2 Extract and organize the dataset

Extract the downloaded archive, clean up intermediate files, and rename the dataset to follow the A2D naming convention:

tar -xzf data/real/genie1/Put_the_pen_from_the_table_into_the_pen_holder.tar.gz -C data

rm -rf data/real

mkdir -p data/v21
mv data/set_0 data/v21/a2d_pick_pen

After this step, the dataset directory structure should be:

data/
└── v21/
    └── a2d_pick_pen/
        ├── data/
        ├── meta/
        └── videos/

2. Convert the dataset from v2.1 to v3.0 format

The original dataset is stored in LeRobot v2.1 format. This project requires LeRobot v3.0, so a format conversion is required.

Run the following command to convert the dataset:

python src/lerobot/datasets/v30/convert_my_dataset_v21_to_v30.py \
    --old-repo-id v21/a2d_pick_pen \
    --new-repo-id v30/a2d_pick_pen

After conversion, the dataset will be available at:

data/v30/a2d_pick_pen/

3. Compute normalization statistics for relative actions (required)

This project fine-tunes policies using relative (delta) actions. Therefore, you must compute per-dataset normalization statistics (e.g., mean/std) for the action stream before training.

Run the following command to compute statistics for v30/a2d_pick_pen:

python util_scripts/compute_norm_stats_single.py \
  --action_mode delta \
  --chunk_size 50 \
  --repo_id v30/a2d_pick_pen

This script will write a stats.json file under ${HF_HOME}/lerobot/stats/delta/v30/a2d_pick_pen/stats.json.


4. Fine-tune InternVLA-A1-3B on v30/a2d_pick_pen

One-line command

bash launch/internvla_a1_3b_finetune.sh v30/a2d_pick_pen delta true

v30/a2d_pick_pen specifies the dataset, delta indicates that relative (delta) actions are used, and true means that external normalization statistics are loaded instead of using the dataset’s built-in stats.json.

⚠️ Important Note

Before running launch/internvla_a1_3b_finetune.sh, make sure to replace the environment variables inside the script with your own settings, including but not limited to:

  • HF_HOME
  • WANDB_API_KEY
  • CONDA_ROOT
  • CUDA / GPU-related environment variables
  • Paths to your local dataset and output directories

License and Citation

All the code within this repo are under CC BY-NC-SA 4.0. Please consider citing our project if it helps your research.

@article{contributors2026internvla_a1,
  title={InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation},
  author={InternVLA-A1 contributors},
  journal={arXiv preprint arXiv:2601.02456},
  year={2026}
}

❤️ Acknowledgments

About

InternVLA-A1: Unifying Understanding, Generation, and Action for Robotic Manipulation​

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published