Skip to content

jxbi1010/VLA-Touch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VLA-Touch: Enhancing Vision-Language-Action Models with Dual-Level Tactile Feedback

pytorch License arXiv

This repo contains the code implementation for VLA-Touch:

 

VLA-Touch: Enhancing Vision-Language-Action Models with Dual-Level Tactile Feedback
Jianxin Bi 1, Kevin Yuchen Ma 1, Ce Hao 1, Mike Zheng Shou 1, Harold Soh 1,2
1Dept. of Computer Science, National University of Singapore
2Smart Systems Institute, NUS

 

[Arxiv] [Project Page] [Video]

🧾 Introduction

We present VLA-Touch, an approach that enhances generalist robot policies with tactile sensing without fine-tuning the base VLA. Our method introduces two key innovations: (1) a pipeline that leverages a pretrained tactile-language model that provides semantic tactile feedback for high-level task planning, and (2) a diffusion-based controller that refines VLA-generated actions with tactile signals for contact-rich manipulation. Through real-world experiments, we demonstrate that our dual-level integration of tactile feedback improves task planning efficiency while enhancing execution precision.

VLA-Touch Framework

Figure 1:Dual-level Tactile feedback framework of VLA-Touch. Planning: Given a scene image $s_t$ and task goal $g$, the VLM Task Planner generates manipulation instruction $I_k$ for policy execution. A tactile-language model (Octopi) converts a sequence tactile input $o^m_{t-n:t}$ to language description $L^m_t$, which informs VLM for updated instruction. Manipulation: The base VLA $\pi(a_t|s_t,I_k)$ generates action chunk $a_t$ based on visual observation $s_t$ and instruction $I_k$. The action chunk is then refined by an interpolant policy $\pi_I(\hat a_t|s_t,a_t,m_t)$ that takes as input both visual embeddings from a pretrained DinoV2 model and low-dimensional tactile signals $m_t$ processed a marker tracking algorithm from raw tactile input $o^m_t$.

💻 Installation

  1. Follow RDT-1B installation
    See the official instructions: RDT-1B installation

  2. Clone VLA-Touch and copy files to RDT-1B (replace original files):

    git clone https://github.com/jxbi1010/VLA-Touch
    # Copy relevant files to your RDT-1B directory, replacing originals as needed
  3. Download dataset and controller checkpoints: Google Drive Folder or Hugging Face for Processed Dataset

    • Copy controller checkpoints:
      cp controller_ckpt/* VLA/residual_controller/checkpoints/
  4. Dataset processing (for reference):

    • Copy dataset files:

      cp vla_data/* VLA/data/datasets/
    • Convert raw data to .h5 format:

      # Run the provided scripts to convert raw data
      cd VLA/data/franka_data
      python convert*_to_h5.py  # Replace with actual processing scripts
      # The resulting files should look like: vla_data/wipe_example/episode_*.h5
    • If you need our processed dataset, kindly approach us.

  5. Compute dataset stats and update configs:

    # Use RDT scripts to compute dataset statistics
    python compute_dataset_stats.py 
  6. Install Octopi:
    Follow the instructions in:

    octopi/README.md
    
  7. Copy Octopi data files:

    # Download from Google Drive and copy to the correct location
    cp octopi_data/* octopi/octopi_s/data/

    Google Drive Folder for Octopi Data

🛠️ Usage

  1. Follow RDT-1B for VLA base model finetuning without tactile data.
  2. Run scripts in residual_controller/ for controller training and test, e.g.
# training for interpolant controller
python bridge_train.py

# testing for interpolant controller
python bridger_test.py

#training for residual controller
python lstm_train.py

# testing for residual controller
python lstm_step_test.py
  1. For ocpoti inference, run octopi/octopi_s/touch_vla.py using your own VLM API.
  2. Inference method is modified based on RDT inference script, our version will release soon.

📝 Citation

If you find our work useful, please consider citing:

@misc{bi2025vlatouchenhancingvisionlanguageactionmodels,
      title={VLA-Touch: Enhancing Vision-Language-Action Models with Dual-Level Tactile Feedback}, 
      author={Jianxin Bi and Kevin Yuchen Ma and Ce Hao and Mike Zheng Shou and Harold Soh},
      year={2025},
      eprint={2507.17294},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2507.17294}, 
}



🏷️ License

VLA-Touch is licensed under the MIT license. See the LICENSE file for details.



🙏 Acknowledgement

VLA-Touch is developed based on many open-sourced works, including BRIDGeR, Octopi and RDT-1B. We thank all these authors for their nicely open sourced code and their great contributions to the community.

About

Implementation of paper: VLA-Touch: Enhancing Vision-Language-Action Models with Dual-Level Tactile Feedback

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages