Pixel LLMs: Pixel-Level Grounded Understanding for Multimodal LLMs

Pixel LLMs is a family of projects that bring pixel-level, dense grounded understanding to multimodal LLMs. It is anchored by Sa2VA — a unified model that marries SAM-2 with LLaVA for dense grounded understanding of images and videos — together with a growing set of research projects built on top of it.

Projects

🧠 Sa2VA — Marrying SAM2 with LLaVA

Haobo Yuan, Xiangtai Li, Tao Zhang, Yueyi Sun, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, Ming-Hsuan Yang

The core unified model: SAM-2 + MLLM for referring segmentation, grounded conversation, visual prompting, and image/video chat. Supports InternVL2.5/3 and Qwen2.5-VL/Qwen3-VL backbones.

📂 projects/sa2va · 📜 arXiv · 🏠 Page · 🤗 Models

🔍 VRT — Visual Reasoning Tracer

Haobo Yuan, Yueyi Sun, Yanwei Li, Tao Zhang, Xueqing Deng, Henghui Ding, Lu Qi, Anran Wang, Xiangtai Li, Ming-Hsuan Yang

Object-level grounded reasoning built on Sa2VA. Ships VRT-Bench (evaluation) and VRT-80k (training data).

📂 projects/vrt_sa2va · 📜 arXiv · 🏠 Page · 🤗 Data

🧩 SAMTok — Representing Any Mask with Two Words (CVPR 2026)

Yikang Zhou, Tao Zhang, Dengxian Gong, Yuanzheng Wu, Ye Tian, Haochen Wang, Haobo Yuan, Jiacong Wang, Lu Qi, Hao Fei, Anran Wang, Zhuochen Wang, Yujing Wang, Cheng Chen, Shunping Ji, Xiangtai Li

A unified mask-token interface that lets any MLLM generate and understand masks.

📂 projects/samtok · 📜 arXiv · 🏠 Page · 🤗 Models

Extensions

SaSaSa2VA — a segmentation-augmented extension of Sa2VA that won 1st place in the ICCV 2025 LSVOS Challenge RVOS Track 🏅.
Pixel-SAIL — single-transformer pixel-level grounding.

Environment

We manage dependencies with uv. Install it once:

curl -LsSf https://astral.sh/uv/install.sh | sh

The environment is defined under projects/sa2va (pyproject.toml + uv.lock) and shared across the projects. The quickest way to set it up — with the virtualenv placed in /tmp and symlinked back into the project — is the helper script at the repo root:

bash setup_env.sh                 # projects/sa2va, --extra=latest
# bash setup_env.sh sa2va legacy  # InternVL2.5 or earlier

Or do it manually:

cd projects/sa2va
uv sync --extra=latest            # or --extra=legacy

Then run training / evaluation from the repository root with the environment activated (source projects/sa2va/.venv/bin/activate). See each project's README for project-specific steps.

For tokens / API keys (HuggingFace, OpenRouter), copy the template and fill it in — setup_env.sh loads it automatically:

cp .env.example .env   # then edit .env

Why uv? It treats the environment as code: dependencies are declared in pyproject.toml and every transitive package is version-locked in uv.lock. The result is a single source of truth that is fully reproducible across machines, trivial to maintain, and recreated exactly with one uv sync — no manual pip install drift.

Citation

If you find this repository useful, please consider citing the relevant papers:

@article{sa2va,
  title={Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos},
  author={Yuan, Haobo and Li, Xiangtai and Zhang, Tao and Sun, Yueyi and Huang, Zilong and Xu, Shilin and Ji, Shunping and Tong, Yunhai and Qi, Lu and Feng, Jiashi and Yang, Ming-Hsuan},
  journal={arXiv pre-print},
  year={2025}
}

@article{yuan2025vrt,
  title={Visual Reasoning Tracer: Object-Level Grounded Reasoning Benchmark},
  author={Yuan, Haobo and Sun, Yueyi and Li, Yanwei and Zhang, Tao and Deng, Xueqing and Ding, Henghui and Qi, Lu and Wang, Anran and Li, Xiangtai and Yang, Ming-Hsuan},
  journal={arXiv pre-print},
  year={2025}
}

@inproceedings{zhou2026samtok,
  title={SAMTok: Representing Any Mask with Two Words},
  author={Zhou, Yikang and Zhang, Tao and Gong, Dengxian and Wu, Yuanzheng and Tian, Ye and Wang, Haochen and Yuan, Haobo and Wang, Jiacong and Qi, Lu and Fei, Hao and Wang, Anran and Wang, Zhuochen and Wang, Yujing and Chen, Cheng and Ji, Shunping and Li, Xiangtai},
  booktitle={CVPR},
  address={Denver, CO, USA},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
assets		assets
projects		projects
sa2va_eval		sa2va_eval
third_parts		third_parts
tools		tools
vlm		vlm
.env.example		.env.example
.gitignore		.gitignore
.skip_uv_sync		.skip_uv_sync
LICENSE		LICENSE
README.md		README.md
setup_env.sh		setup_env.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pixel LLMs: Pixel-Level Grounded Understanding for Multimodal LLMs

Projects

🧠 Sa2VA — Marrying SAM2 with LLaVA

🔍 VRT — Visual Reasoning Tracer

🧩 SAMTok — Representing Any Mask with Two Words (CVPR 2026)

Extensions

Environment

Citation

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Pixel LLMs: Pixel-Level Grounded Understanding for Multimodal LLMs

Projects

🧠 Sa2VA — Marrying SAM2 with LLaVA

🔍 VRT — Visual Reasoning Tracer

🧩 SAMTok — Representing Any Mask with Two Words (CVPR 2026)

Extensions

Environment

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages