Pixel LLMs is a family of projects that bring pixel-level, dense grounded understanding to multimodal LLMs. It is anchored by Sa2VA — a unified model that marries SAM-2 with LLaVA for dense grounded understanding of images and videos — together with a growing set of research projects built on top of it.
🧠 Sa2VA — Marrying SAM2 with LLaVA
Haobo Yuan, Xiangtai Li, Tao Zhang, Yueyi Sun, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, Ming-Hsuan Yang
The core unified model: SAM-2 + MLLM for referring segmentation, grounded conversation, visual prompting, and image/video chat. Supports InternVL2.5/3 and Qwen2.5-VL/Qwen3-VL backbones.
📂 projects/sa2va · 📜 arXiv · 🏠 Page · 🤗 Models
🔍 VRT — Visual Reasoning Tracer
Haobo Yuan, Yueyi Sun, Yanwei Li, Tao Zhang, Xueqing Deng, Henghui Ding, Lu Qi, Anran Wang, Xiangtai Li, Ming-Hsuan Yang
Object-level grounded reasoning built on Sa2VA. Ships VRT-Bench (evaluation) and VRT-80k (training data).
📂 projects/vrt_sa2va · 📜 arXiv · 🏠 Page · 🤗 Data
🧩 SAMTok — Representing Any Mask with Two Words (CVPR 2026)
Yikang Zhou, Tao Zhang, Dengxian Gong, Yuanzheng Wu, Ye Tian, Haochen Wang, Haobo Yuan, Jiacong Wang, Lu Qi, Hao Fei, Anran Wang, Zhuochen Wang, Yujing Wang, Cheng Chen, Shunping Ji, Xiangtai Li
A unified mask-token interface that lets any MLLM generate and understand masks.
📂 projects/samtok · 📜 arXiv · 🏠 Page · 🤗 Models
- SaSaSa2VA — a segmentation-augmented extension of Sa2VA that won 1st place in the ICCV 2025 LSVOS Challenge RVOS Track 🏅.
- Pixel-SAIL — single-transformer pixel-level grounding.
We manage dependencies with uv. Install it once:
curl -LsSf https://astral.sh/uv/install.sh | shThe environment is defined under projects/sa2va (pyproject.toml + uv.lock) and shared across the projects. The quickest way to set it up — with the virtualenv placed in /tmp and symlinked back into the project — is the helper script at the repo root:
bash setup_env.sh # projects/sa2va, --extra=latest
# bash setup_env.sh sa2va legacy # InternVL2.5 or earlierOr do it manually:
cd projects/sa2va
uv sync --extra=latest # or --extra=legacyThen run training / evaluation from the repository root with the environment activated (source projects/sa2va/.venv/bin/activate). See each project's README for project-specific steps.
For tokens / API keys (HuggingFace, OpenRouter), copy the template and fill it in — setup_env.sh loads it automatically:
cp .env.example .env # then edit .envWhy uv? It treats the environment as code: dependencies are declared in pyproject.toml and every transitive package is version-locked in uv.lock. The result is a single source of truth that is fully reproducible across machines, trivial to maintain, and recreated exactly with one uv sync — no manual pip install drift.
If you find this repository useful, please consider citing the relevant papers:
@article{sa2va,
title={Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos},
author={Yuan, Haobo and Li, Xiangtai and Zhang, Tao and Sun, Yueyi and Huang, Zilong and Xu, Shilin and Ji, Shunping and Tong, Yunhai and Qi, Lu and Feng, Jiashi and Yang, Ming-Hsuan},
journal={arXiv pre-print},
year={2025}
}
@article{yuan2025vrt,
title={Visual Reasoning Tracer: Object-Level Grounded Reasoning Benchmark},
author={Yuan, Haobo and Sun, Yueyi and Li, Yanwei and Zhang, Tao and Deng, Xueqing and Ding, Henghui and Qi, Lu and Wang, Anran and Li, Xiangtai and Yang, Ming-Hsuan},
journal={arXiv pre-print},
year={2025}
}
@inproceedings{zhou2026samtok,
title={SAMTok: Representing Any Mask with Two Words},
author={Zhou, Yikang and Zhang, Tao and Gong, Dengxian and Wu, Yuanzheng and Tian, Ye and Wang, Haochen and Yuan, Haobo and Wang, Jiacong and Qi, Lu and Fei, Hao and Wang, Anran and Wang, Zhuochen and Wang, Yujing and Chen, Cheng and Ji, Shunping and Li, Xiangtai},
booktitle={CVPR},
address={Denver, CO, USA},
year={2026}
}