VLA-Touch: Enhancing Vision-Language-Action Models with Dual-Level Tactile Feedback

This repo contains the code implementation for VLA-Touch:

VLA-Touch: Enhancing Vision-Language-Action Models with Dual-Level Tactile Feedback
Jianxin Bi ¹, Kevin Yuchen Ma ¹, Ce Hao ¹, Mike Zheng Shou ¹, Harold Soh ^1,2
¹Dept. of Computer Science, National University of Singapore
²Smart Systems Institute, NUS

[Arxiv] [Project Page] [Video]

🧾 Introduction

We present VLA-Touch, an approach that enhances generalist robot policies with tactile sensing without fine-tuning the base VLA. Our method introduces two key innovations: (1) a pipeline that leverages a pretrained tactile-language model that provides semantic tactile feedback for high-level task planning, and (2) a diffusion-based controller that refines VLA-generated actions with tactile signals for contact-rich manipulation. Through real-world experiments, we demonstrate that our dual-level integration of tactile feedback improves task planning efficiency while enhancing execution precision.

Figure 1:Dual-level Tactile feedback framework of VLA-Touch. Planning: Given a scene image $s_t$ and task goal $g$, the VLM Task Planner generates manipulation instruction $I_k$ for policy execution. A tactile-language model (Octopi) converts a sequence tactile input $o^m_{t-n:t}$ to language description $L^m_t$, which informs VLM for updated instruction. Manipulation: The base VLA $\pi(a_t|s_t,I_k)$ generates action chunk $a_t$ based on visual observation $s_t$ and instruction $I_k$. The action chunk is then refined by an interpolant policy $\pi_I(\hat a_t|s_t,a_t,m_t)$ that takes as input both visual embeddings from a pretrained DinoV2 model and low-dimensional tactile signals $m_t$ processed a marker tracking algorithm from raw tactile input $o^m_t$.

💻 Installation

Follow RDT-1B installation
See the official instructions: RDT-1B installation

Clone VLA-Touch and copy files to RDT-1B (replace original files):

git clone https://github.com/jxbi1010/VLA-Touch
# Copy relevant files to your RDT-1B directory, replacing originals as needed

Download dataset and controller checkpoints: Google Drive Folder or Hugging Face for Processed Dataset
- Copy controller checkpoints:
```
cp controller_ckpt/* VLA/residual_controller/checkpoints/
```

Dataset processing (for reference):

Copy dataset files:
```
cp vla_data/* VLA/data/datasets/
```

Convert raw data to .h5 format:

# Run the provided scripts to convert raw data
cd VLA/data/franka_data
python convert*_to_h5.py  # Replace with actual processing scripts
# The resulting files should look like: vla_data/wipe_example/episode_*.h5

If you need our processed dataset, kindly approach us.

Compute dataset stats and update configs:

# Use RDT scripts to compute dataset statistics
python compute_dataset_stats.py

Install Octopi:
Follow the instructions in:
```
octopi/README.md
```

Copy Octopi data files:

# Download from Google Drive and copy to the correct location
cp octopi_data/* octopi/octopi_s/data/

Google Drive Folder for Octopi Data

🛠️ Usage

Follow RDT-1B for VLA base model finetuning without tactile data.
Run scripts in residual_controller/ for controller training and test, e.g.

# training for interpolant controller
python bridge_train.py

# testing for interpolant controller
python bridger_test.py

#training for residual controller
python lstm_train.py

# testing for residual controller
python lstm_step_test.py

For ocpoti inference, run octopi/octopi_s/touch_vla.py using your own VLM API.
Inference method is modified based on RDT inference script, our version will release soon.

📝 Citation

If you find our work useful, please consider citing:

@misc{bi2025vlatouchenhancingvisionlanguageactionmodels,
      title={VLA-Touch: Enhancing Vision-Language-Action Models with Dual-Level Tactile Feedback}, 
      author={Jianxin Bi and Kevin Yuchen Ma and Ce Hao and Mike Zheng Shou and Harold Soh},
      year={2025},
      eprint={2507.17294},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2507.17294}, 
}

🏷️ License

VLA-Touch is licensed under the MIT license. See the LICENSE file for details.

🙏 Acknowledgement

VLA-Touch is developed based on many open-sourced works, including BRIDGeR, Octopi and RDT-1B. We thank all these authors for their nicely open sourced code and their great contributions to the community.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VLA-Touch: Enhancing Vision-Language-Action Models with Dual-Level Tactile Feedback

🧾 Introduction

💻 Installation

🛠️ Usage

📝 Citation

🏷️ License

🙏 Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
VLA		VLA
assets		assets
octopi		octopi
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

VLA-Touch: Enhancing Vision-Language-Action Models with Dual-Level Tactile Feedback

🧾 Introduction

💻 Installation

🛠️ Usage

📝 Citation

🏷️ License

🙏 Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages