Skip to content

Recommended Model and Feature Matrices

Although vLLM TPU’s new unified backend makes out-of-the-box high performance serving possible with any model supported in vLLM, the reality is that we're still in the process of implementing a few core components. For this reason, until we land more capabilities, we recommend starting from this list of stress tested models and features below.

We are still landing components in tpu-inference that will improve performance for larger scale, higher complexity models (XL MoE, +vision encoders, MLA, etc.).

If you’d like us to prioritize something specific, please submit a GitHub feature request here.

These tables show the models currently tested for accuracy and performance.

Text-Only Models

Model UnitTest IntegrationTest Benchmark
meta-llama/Llama-3.3-70B-Instruct
Qwen/Qwen3-4B
google/gemma-3-27b-it
Qwen/Qwen3-32B
meta-llama/Llama-Guard-4-12B
meta-llama/Llama-3.1-8B-Instruct
Qwen/Qwen3-30B-A3B

Multimodal Models

Model UnitTest IntegrationTest Benchmark
Qwen/Qwen2.5-VL-7B-Instruct

This table shows the features currently tested for accuracy and performance.

Feature CorrectnessTest PerformanceTest
Chunked Prefill
DCN-based P/D disaggregation to be added to be added
KV cache host offloading to be added to be added
Llama 4 Maverick to be added to be added
LoRA_Torch to be added
Multimodal Inputs
Out-of-tree model support
Prefix Caching
Single Program Multi Data
Speculative Decoding: Eagle3
Speculative Decoding: Ngram
async scheduler
runai_model_streamer_loader N/A
sampling_params N/A
structured_decoding N/A

Kernel Support

This table shows the current kernel support status.

Feature CorrectnessTest PerformanceTest
Collective Communication Matmul to be added
MLA to be added to be added
MoE to be added to be added
Quantized Attention to be added to be added
Quantized KV Cache to be added to be added
Quantized Matmul to be added to be added
Ragged Paged Attention V3

Parallelism Support

This table shows the current parallelism support status.

Feature CorrectnessTest PerformanceTest
CP to be added to be added
DP N/A
EP to be added to be added
PP
SP to be added to be added
TP to be added to be added

Quantization Support

This table shows the current quantization support status.

Feature Recommended TPU Generations CorrectnessTest PerformanceTest
AWQ INT4 v5, v6 to be added to be added
FP4 W4A16 v7 to be added to be added
FP8 W8A8 v7 to be added to be added
FP8 W8A16 v7 to be added to be added
INT4 W4A16 v5, v6 to be added to be added
INT8 W8A8 v5, v6 to be added to be added