Recommended Model and Feature Matrices¶

Although vLLM TPU’s new unified backend makes out-of-the-box high performance serving possible with any model supported in vLLM, the reality is that we're still in the process of implementing a few core components. For this reason, until we land more capabilities, we recommend starting from this list of stress tested models and features below.

We are still landing components in tpu-inference that will improve performance for larger scale, higher complexity models (XL MoE, +vision encoders, MLA, etc.).

If you’d like us to prioritize something specific, please submit a GitHub feature request here.

Recommended Models¶

These tables show the models currently tested for accuracy and performance.

Text-Only Models¶

Model	UnitTest	IntegrationTest	Benchmark
meta-llama/Llama-3.3-70B-Instruct	✅	✅	✅
Qwen/Qwen3-4B	✅	✅	✅
google/gemma-3-27b-it	✅	✅	✅
Qwen/Qwen3-32B	✅	✅	✅
meta-llama/Llama-Guard-4-12B	✅	✅	✅
meta-llama/Llama-3.1-8B-Instruct	✅	✅	✅
Qwen/Qwen3-30B-A3B	✅	✅	✅

Multimodal Models¶

Model	UnitTest	IntegrationTest	Benchmark
Qwen/Qwen2.5-VL-7B-Instruct	✅	✅	✅

Recommended Features¶

This table shows the features currently tested for accuracy and performance.

Feature	CorrectnessTest	PerformanceTest
Chunked Prefill	✅	✅
DCN-based P/D disaggregation	to be added	to be added
KV cache host offloading	to be added	to be added
Llama 4 Maverick	to be added	to be added
LoRA_Torch	✅	to be added
Multimodal Inputs	✅	✅
Out-of-tree model support	✅	✅
Prefix Caching	✅	✅
Single Program Multi Data	✅	✅
Speculative Decoding: Eagle3	✅	✅
Speculative Decoding: Ngram	✅	✅
async scheduler	✅	✅
runai_model_streamer_loader	✅	N/A
sampling_params	✅	N/A
structured_decoding	✅	N/A

Kernel Support¶

This table shows the current kernel support status.

Feature	CorrectnessTest	PerformanceTest
Collective Communication Matmul	✅	to be added
MLA	to be added	to be added
MoE	to be added	to be added
Quantized Attention	to be added	to be added
Quantized KV Cache	to be added	to be added
Quantized Matmul	to be added	to be added
Ragged Paged Attention V3	✅	✅

Parallelism Support¶

This table shows the current parallelism support status.

Feature	CorrectnessTest	PerformanceTest
CP	to be added	to be added
DP	❌	N/A
EP	to be added	to be added
PP	✅	✅
SP	to be added	to be added
TP	to be added	to be added

Quantization Support¶

This table shows the current quantization support status.

Feature	Recommended TPU Generations	CorrectnessTest	PerformanceTest
AWQ INT4	v5, v6	to be added	to be added
FP4 W4A16	v7	to be added	to be added
FP8 W8A8	v7	to be added	to be added
FP8 W8A16	v7	to be added	to be added
INT4 W4A16	v5, v6	to be added	to be added
INT8 W8A8	v5, v6	to be added	to be added