TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models

Authors: Yao Xiao, Qiqian Fu, Heyi Tao, Yuqun Wu, Zhen Zhu, Derek Hoiem
University of Illinois at Urbana-Champaign

📢 News

[2025-11-03]: Proud to share that TextRegion was accepted to Transactions on Machine Learning Research (TMLR) and received a J2C Certification — top ~10% among accepted papers!
[2025-05-29]: We dropped the paper and code for TextRegion — go check out the magic here!

🧠 Overview

TextRegion is a training-free framework that generates text-aligned region tokens by combining frozen image-text models (e.g., CLIP, SigLIP2, Perception Encoder) with segmentation masks from SAM2. These region tokens enable impressive zero-shot performance in detailed visual understanding tasks such as:

Open-world semantic segmentation
Referring expression comprehension
Multi-object grounding

“A simple, general, effective, and training-free approach to create text-compatible region tokens.”

📦 Installation

git clone https://github.com/avaxiao/TextRegion.git
cd TextRegion
conda create -n TextRegion python=3.10 -y
conda activate TextRegion
bash setup_env.sh

🚀 Demo

Before run the demo, you need to download the sam2.1_hiera_large.pt by the link provided in SAM2's repo.

By configuring --sam2_checkpoint and --clip_download_root in TextRegionSegmenter.py, you may run demo directly by:

python TextRegionSegmenter.py

To use a different image-text model, update --clip_pretrained and --clip_architecture accordingly.

To run inference on a custom image, edit the ./utils/image_query_label.yaml file and set --image_list in TextRegionSegmenter.py to your image path.

📊 Evaluation

1. Open-World Semantic Segmentation

Preparing Data

Please follow the MMSeg data preparation document to download and pre-process the datasets including PASCAL VOC, PASCAL Context, Cityscapes, ADE20k, COCO Object and COCO-Stuff164k. We provide some dataset processing scripts in ./process_dataset.sh.

Evaluation

Please modify the setting sam2_checkpoint and clip_download_root in configs/base_config.py. You also need to change data_root in corresponding dataset configuration files in folder configs/cfg_ds. Then you may eval specific dataset by:

python eval_semantic.py --config ./config/cfg_ds/cfg_DATASET.py --work-dir YOUR_WORK_DIR

or eval on all datasets:

python eval_all_semantic.py

Results are listed in YOUR_WORK_DIR.

2. Referring Expression Comprehension

1.Download images for RefCOCO, and then unzip the data using unzip train2014.zip. Put unzipped dataset (train2014) to ./eval/datasets/coco_rec/.

2.Download preprocessed data files reclip_data.tar.gz, and then extract the data using tar -xvzf reclip_data.tar.gz. Put extracted data to ./eval/datasets/coco_rec/.

3.Modify --sam2_checkpoint and --clip_download_root in TextRegionSegmenter.py.

4.Run the evaluation script:

python eval_referring.py --input_file_root ./eval/datasets/coco_rec/reclip_data --image_root ./eval/datasets/coco_rec/train2014

3. Multi-Object Grounding

1.Download Reasoning Segmentation Test Dataset. Unzip the test.zip to ./eval/datasets/reason_seg/.

2.Download the interpreted query file test.tar.gz, and then extract the data using tar -xvzf test.tar.gz and put it to ./eval/datasets/reason_seg/interpreted_llava_v15_7b/.

3.Modify --sam2_checkpoint and --clip_download_root in TextRegionSegmenter.py.

4.Run the evaluation script:

python eval_reason_seg.py --dataset_dir ./eval/datasets --interpreted_query_dir ./eval/datasets/reason_seg/interpreted_llava_v15_7b/test

📈 Citation

If you find TextRegion useful, please consider citing:

@article{xiao2025textregion,
  title={TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models},
  author={Xiao, Yao and Fu, Qiqian and Tao, Heyi and Wu, Yuqun and Zhu, Zhen and Hoiem, Derek},
  journal={arXiv preprint arXiv:2505.23769},
  year={2025}
}

📝 Acknowledgements

This work is built upon SAM2, Trident, SCLIP, OpenCLIP, ReCLIP and LISA. Thanks for their excellent works.

License

This project is released under the MIT License.

Note that some of the softwares to download and install for this project are subject to separate copyright notices and license terms, which use is subject to the terms and conditions under which they are made available.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
configs		configs
core		core
custom_clip		custom_clip
custom_open_clip		custom_open_clip
datasets		datasets
mmengine		mmengine
mmseg		mmseg
sam2		sam2
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TextRegionSegmenter.py		TextRegionSegmenter.py
custom_datasets.py		custom_datasets.py
dist_eval_semantic.sh		dist_eval_semantic.sh
eval_all_semantic.py		eval_all_semantic.py
eval_reason_seg.py		eval_reason_seg.py
eval_referring.py		eval_referring.py
eval_semantic.py		eval_semantic.py
model_semantic.py		model_semantic.py
process_dataset.sh		process_dataset.sh
requirements.txt		requirements.txt
setup_env.sh		setup_env.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models

📢 News

🧠 Overview

📦 Installation

🚀 Demo

📊 Evaluation

1. Open-World Semantic Segmentation

Preparing Data

Evaluation

2. Referring Expression Comprehension

3. Multi-Object Grounding

📈 Citation

📝 Acknowledgements

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

avaxiao/TextRegion

Folders and files

Latest commit

History

Repository files navigation

TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models

📢 News

🧠 Overview

📦 Installation

🚀 Demo

📊 Evaluation

1. Open-World Semantic Segmentation

Preparing Data

Evaluation

2. Referring Expression Comprehension

3. Multi-Object Grounding

📈 Citation

📝 Acknowledgements

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages