Recommended Model and Feature Matrices¶
Although vLLM TPU’s new unified backend makes out-of-the-box high performance serving possible with any model supported in vLLM, the reality is that we're still in the process of implementing a few core components. For this reason, until we land more capabilities, we recommend starting from this list of stress tested models and features below.
We are still landing components in tpu-inference that will improve performance for larger scale, higher complexity models (XL MoE, +vision encoders, MLA, etc.).
If you’d like us to prioritize something specific, please submit a GitHub feature request here.
Recommended Models¶
These tables show the models currently tested for accuracy and performance.
Text-Only Models¶
| Model | UnitTest | IntegrationTest | Benchmark |
|---|---|---|---|
| meta-llama/Llama-3.3-70B-Instruct | ✅ | ✅ | ✅ |
| Qwen/Qwen3-4B | ✅ | ✅ | ✅ |
| google/gemma-3-27b-it | ✅ | ✅ | ✅ |
| Qwen/Qwen3-32B | ✅ | ✅ | ✅ |
| meta-llama/Llama-Guard-4-12B | ✅ | ✅ | ✅ |
| meta-llama/Llama-3.1-8B-Instruct | ✅ | ✅ | ✅ |
| Qwen/Qwen3-30B-A3B | ✅ | ✅ | ✅ |
Multimodal Models¶
| Model | UnitTest | IntegrationTest | Benchmark |
|---|---|---|---|
| Qwen/Qwen2.5-VL-7B-Instruct | ✅ | ✅ | ✅ |
Recommended Features¶
This table shows the features currently tested for accuracy and performance.
| Feature | CorrectnessTest | PerformanceTest |
|---|---|---|
| Chunked Prefill | ✅ | ✅ |
| DCN-based P/D disaggregation | to be added | to be added |
| KV cache host offloading | to be added | to be added |
| Llama 4 Maverick | to be added | to be added |
| LoRA_Torch | ✅ | to be added |
| Multimodal Inputs | ✅ | ✅ |
| Out-of-tree model support | ✅ | ✅ |
| Prefix Caching | ✅ | ✅ |
| Single Program Multi Data | ✅ | ✅ |
| Speculative Decoding: Eagle3 | ✅ | ✅ |
| Speculative Decoding: Ngram | ✅ | ✅ |
| async scheduler | ✅ | ✅ |
| runai_model_streamer_loader | ✅ | N/A |
| sampling_params | ✅ | N/A |
| structured_decoding | ✅ | N/A |
Kernel Support¶
This table shows the current kernel support status.
| Feature | CorrectnessTest | PerformanceTest |
|---|---|---|
| Collective Communication Matmul | ✅ | to be added |
| MLA | to be added | to be added |
| MoE | to be added | to be added |
| Quantized Attention | to be added | to be added |
| Quantized KV Cache | to be added | to be added |
| Quantized Matmul | to be added | to be added |
| Ragged Paged Attention V3 | ✅ | ✅ |
Parallelism Support¶
This table shows the current parallelism support status.
| Feature | CorrectnessTest | PerformanceTest |
|---|---|---|
| CP | to be added | to be added |
| DP | ❌ | N/A |
| EP | to be added | to be added |
| PP | ✅ | ✅ |
| SP | to be added | to be added |
| TP | to be added | to be added |
Quantization Support¶
This table shows the current quantization support status.
| Feature | Recommended TPU Generations | CorrectnessTest | PerformanceTest |
|---|---|---|---|
| AWQ INT4 | v5, v6 | to be added | to be added |
| FP4 W4A16 | v7 | to be added | to be added |
| FP8 W8A8 | v7 | to be added | to be added |
| FP8 W8A16 | v7 | to be added | to be added |
| INT4 W4A16 | v5, v6 | to be added | to be added |
| INT8 W8A8 | v5, v6 | to be added | to be added |