Skip to content

avaxiao/TextRegion

Repository files navigation

TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models

arXiv

Authors: Yao Xiao, Qiqian Fu, Heyi Tao, Yuqun Wu, Zhen Zhu, Derek Hoiem
University of Illinois at Urbana-Champaign

Teaser

📢 News

  • [2025-11-03]: Proud to share that TextRegion was accepted to Transactions on Machine Learning Research (TMLR) and received a J2C Certification — top ~10% among accepted papers!

  • [2025-05-29]: We dropped the paper and code for TextRegion — go check out the magic here!

🧠 Overview

TextRegion is a training-free framework that generates text-aligned region tokens by combining frozen image-text models (e.g., CLIP, SigLIP2, Perception Encoder) with segmentation masks from SAM2. These region tokens enable impressive zero-shot performance in detailed visual understanding tasks such as:

  • Open-world semantic segmentation
  • Referring expression comprehension
  • Multi-object grounding

“A simple, general, effective, and training-free approach to create text-compatible region tokens.”

📦 Installation

git clone https://github.com/avaxiao/TextRegion.git
cd TextRegion
conda create -n TextRegion python=3.10 -y
conda activate TextRegion
bash setup_env.sh

🚀 Demo

Before run the demo, you need to download the sam2.1_hiera_large.pt by the link provided in SAM2's repo.

By configuring --sam2_checkpoint and --clip_download_root in TextRegionSegmenter.py, you may run demo directly by:

python TextRegionSegmenter.py

To use a different image-text model, update --clip_pretrained and --clip_architecture accordingly.

To run inference on a custom image, edit the ./utils/image_query_label.yaml file and set --image_list in TextRegionSegmenter.py to your image path.

📊 Evaluation

1. Open-World Semantic Segmentation

Preparing Data

Please follow the MMSeg data preparation document to download and pre-process the datasets including PASCAL VOC, PASCAL Context, Cityscapes, ADE20k, COCO Object and COCO-Stuff164k. We provide some dataset processing scripts in ./process_dataset.sh.

Evaluation

Please modify the setting sam2_checkpoint and clip_download_root in configs/base_config.py. You also need to change data_root in corresponding dataset configuration files in folder configs/cfg_ds. Then you may eval specific dataset by:

python eval_semantic.py --config ./config/cfg_ds/cfg_DATASET.py --work-dir YOUR_WORK_DIR

or eval on all datasets:

python eval_all_semantic.py

Results are listed in YOUR_WORK_DIR.

2. Referring Expression Comprehension

1.Download images for RefCOCO, and then unzip the data using unzip train2014.zip. Put unzipped dataset (train2014) to ./eval/datasets/coco_rec/.

2.Download preprocessed data files reclip_data.tar.gz, and then extract the data using tar -xvzf reclip_data.tar.gz. Put extracted data to ./eval/datasets/coco_rec/.

3.Modify --sam2_checkpoint and --clip_download_root in TextRegionSegmenter.py.

4.Run the evaluation script:

python eval_referring.py --input_file_root ./eval/datasets/coco_rec/reclip_data --image_root ./eval/datasets/coco_rec/train2014

3. Multi-Object Grounding

1.Download Reasoning Segmentation Test Dataset. Unzip the test.zip to ./eval/datasets/reason_seg/.

2.Download the interpreted query file test.tar.gz, and then extract the data using tar -xvzf test.tar.gz and put it to ./eval/datasets/reason_seg/interpreted_llava_v15_7b/.

3.Modify --sam2_checkpoint and --clip_download_root in TextRegionSegmenter.py.

4.Run the evaluation script:

python eval_reason_seg.py --dataset_dir ./eval/datasets --interpreted_query_dir ./eval/datasets/reason_seg/interpreted_llava_v15_7b/test

📈 Citation

If you find TextRegion useful, please consider citing:

@article{xiao2025textregion,
  title={TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models},
  author={Xiao, Yao and Fu, Qiqian and Tao, Heyi and Wu, Yuqun and Zhu, Zhen and Hoiem, Derek},
  journal={arXiv preprint arXiv:2505.23769},
  year={2025}
}

📝 Acknowledgements

This work is built upon SAM2, Trident, SCLIP, OpenCLIP, ReCLIP and LISA. Thanks for their excellent works.

License

This project is released under the MIT License.

Note that some of the softwares to download and install for this project are subject to separate copyright notices and license terms, which use is subject to the terms and conditions under which they are made available.

About

[TMLR 2025 J2C] TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages