GLOVER++: Unleashing the Potential of Affordance Learning from Human Behaviors for Robotic Manipulation
Teli Ma1*, Jia Zheng1*, Zifan Wang1, Ziyao Gao1, Jiaming Zhou1, Junwei Liang1,2,#
*equal contributions, #corresponding author
1HKUST(GZ), 2HKUST
[🌐 Project Page] | [📄 GLOVER++ Paper] | [📄 GLOVER Paper] | 🤗 Huggingface Data | [📺 Video] | [🤗 Pretrained Weights]
- GLOVER++ aims to distill actionable affordance knowledge from rich human videos, and demonstrates the effective transfer as an explicit representation for a variety of manipulation tasks.
- We contribute a large-scale affordance-annotated dataset—HOVA-500K, that provides the necessary scale and diversity to learn generalizable affordance representations.
- We present GLOVER++, a global-to-local paradigm of affordance training policy based on HOVA-500K, showing fine-grained affordance representation and generalizable affordance reasoning capability. GLOVER++ achieves state-of-the-art performance in the HOVA-500K evaluation benchmark.
- Extensive applications in tasks like zero-shot manipulation, multi-task imitation learning, long-horizon and bimanual manipulation demonstrate the huge potential of HOVA-500K and GLOVER++.
- We introduce HOVA-500K, a large-scale affordance-annotated dataset constructed from existing human videos and images. The HOVA-500K comprises 500,000 meticulously annotated images spanning 1,726 object categories and 675 action categories, creating a comprehensive taxon- omy of human-object interactions.
- Download the HOVA-500K dataset, Use the following command to merge the dataset splits into a single .tar.gz file:
cat HANDAL/part_* > HANDAL.tar.gz
cat Ego4D/part_* > Ego4D.tar.gz
cat epic-100/part_* > epic-100.tar.gz
- Uncompress these .tar.gz files and organize them as follows:
├── HOVA-500K
│ ├── 3doi
│ │ ├── GT_gaussian
│ │ └── images
│ ├── Ego4D
│ │ ├── GT_gaussian
│ │ └── frames
│ ├── HANDAL
│ │ └── annotations
│ │ | ├── GT_gaussian_train
│ │ | └── GT_gaussian_test
| │ └── images
│ └── epic-100
Note: the "annotations" files should be put in the same directory as the training code.
- Clone the repository:
git clone https://github.com/TeleeMa/GLOVER.git
cd GLOVER- Install dependencies:
We use Python 3.9
pip install -r requirements.txt- Download pre-trained models:
- LISA Plus 7B model
- CLIP ViT-L/14 model
- SAM ViT-h
- Place them in the specified directories and configure the model paths in the training script.
Basic training command:
bash train_glover.sh
or Advanced training with GLOVER++:
bash train_glover_plus.shNOTE: Key training parameters must be set individually:
--version: /path/to/LISA_Plus_7b--vision-tower: /path/to/clip-vit-large-patch14--sam_vit_path: /path/to/sam_vit_h_4b8939.pth (only for GLOVER++)--dataset_dir: /path/to/HOVA-500K datasets
When training is finished, to get the full model weight:
cd ./runs/glover(++)/ckpt_model && python zero_to_fp32.py . ../pytorch_model.bin
Merge the LoRA weights of pytorch_model.bin, save the resulting model to your desired path in Hugging Face format:
bash merge_weights.sh
bash eval.sh
NOTE: Key evaluation parameters must be set individually:
--dataset_dir: /path/to/HOVA-500K datasets--version: /path/to/GLOVER(++) model--model_arch: Choose from 'glover' or 'glover++'
bash infer.sh
NOTE: Key inference parameters must be set individually:
--version: Path to GLOVER(++) model--model_arch: Choose from 'glover' or 'glover++'--image_path: Path to input image--objects: Target objects(e.g., 'bottle,cup')--actions: Target actions(e.g., 'pick up,raise')
If you find this project useful in your research, please consider citing:
@article{ma2025glover++,
title={GLOVER++: Unleashing the Potential of Affordance Learning from Human Behaviors for Robotic Manipulation},
author={Ma, Teli and Zheng, Jia and Wang, Zifan and Gao, Ziyao and Zhou, Jiaming and Liang, Junwei},
journal={arXiv preprint arXiv:2505.11865},
year={2025}
}

