03 Dec 09:36

khluu

4fd9d6a

v0.12.0 Latest

Latest

vLLM v0.12.0 Release Notes Highlights

Highlights

This release features 474 commits from 213 contributors (57 new)！

Breaking Changes: This release includes PyTorch 2.9.0 upgrade (CUDA 12.9), V0 deprecations including xformers backend, and scheduled removals - please review the changelog carefully.

Major Features:

EAGLE Speculative Decoding Improvements: Multi-step CUDA graph support (#29559), DP>1 support (#26086), and multimodal support with Qwen3VL (#29594).
Significant Performance Optimizations: 18.1% throughput improvement from batch invariant BMM (#29345), 2.2% throughput improvement from shared experts overlap (#28879).
AMD ROCm Expansion: DeepSeek v3.2 + SparseMLA support (#26670), FP8 MLA decode (#28032), AITER attention backend (#28701).

Model Support

New model families: PLaMo-3 (#28834), OpenCUA-7B (#29068), HunyuanOCR (#29327), Mistral Large 3 and Ministral 3 (#29757).
Format support: Gemma3 GGUF multimodal support (#27772).
Multimodal enhancements: Qwen3 Omni audio-in-video support (#27721), Eagle3 multimodal support for Qwen3VL (#29594).
Performance: QwenVL cos/sin cache optimization (#28798).

Engine Core

GPU Model Runner V2 (Experimental) (#25266): Complete refactoring of model execution pipeline:
- No "reordering" or complex bookkeeping with persistent batch removal
- GPU-persistent block tables for better scalability with max_model_len and num_kv_groups
- Triton-native sampler: no -1 temperature hack, efficient per-request seeds, memory-efficient prompt logprobs
- Simplified DP and CUDA graph implementations
- Efficient structured outputs support
Prefill Context Parallel (PCP) (Preparatory) (#28718): Partitions the sequence dimension during prefill for improved long-sequence inference. Complements existing Decode Context Parallel (DCP). See RFC #25749 for details.
RLHF Support: Pause and Resume Generation for Asynchronous RL Training (#28037).
KV Cache Enhancements: Cross-layer KV blocks support (#27743), KV cache residency metrics (#27793).
Audio support: Audio embeddings support in chat completions (#29059).
Speculative Decoding:
- Multi-step Eagle with CUDA graph (#29559)
- EAGLE DP>1 support (#26086)
- EAGLE3 heads without use_aux_hidden_states (#27688)
- Eagle multimodal CUDA graphs with MRoPE (#28896)
- Logprobs support with spec decode + async scheduling (#29223)
Configuration: Flexible inputs_embeds_size separate from hidden_size (#29741), --fully-sharded-loras for fused_moe (#28761).

Hardware & Performance

NVIDIA Performance:
- Batch invariant BMM optimization: 18.1% throughput improvement, 10.7% TTFT improvement on DeepSeek-V3.1 (#29345)
- Shared Experts Overlap with FlashInfer DeepGEMM: 2.2% throughput improvement, 3.6% TTFT improvement at batch size 32 (#28879)
- DeepGEMM N dim restriction reduced from 128 to 64 multiplier (#28687)
- DeepEP low-latency with round-robin expert placement (#28449)
- NVFP4 MoE CUTLASS support for SM120 (#29242)
- H200 Fused MoE Config improvements (#28992)
AMD ROCm:
- DeepSeek v3.2 and SparseMLA support (#26670)
- FP8 MLA decode support (#28032)
- AITER sampling ops integration (#26084)
- AITER triton attention backend (#28701)
- Bitsandbytes quantization on AMD GPUs with warp size 32 (#27307)
- Fastsafetensors support (#28225)
- Sliding window support for AiterFlashAttentionBackend (#29234)
- Whisper v1 with Aiter Unified/Flash Attention (#28376)
CPU:
- Paged attention GEMM acceleration on ARM CPUs with NEON (#29193)
- Parallelize over tokens in int4 MoE (#29600)
- CPU all reduce optimization for async_scheduling + DP>1 (#29311)
Attention: FlashAttention ViT support, now default backend (#28763).
Long Context: Optimized gather_and_maybe_dequant_cache kernel for extremely long sequences (#28029).
Multi-NUMA: Enhanced NUMA functionality for systems with multiple NUMA nodes per socket (#25559).
Docker: Image size reduced by ~200MB (#29060).

Quantization

W4A8: Marlin kernel support (#24722).
NVFP4:
- MoE CUTLASS support for SM120 (#29242)
- TRTLLM MoE NVFP4 kernel (#28892)
- CuteDSL MoE with NVFP4 DeepEP dispatch (#27141)
- Non-gated activations support in modelopt path (#29004)
AWQ: Compressed-tensors AWQ support for Turing GPUs (#29732).
LoRA: FusedMoE LoRA Triton kernel for MXFP4 (#29708).
Online quantization: Moved to model.load_weights (#26327).

API & Frontend

Responses API:
- Multi-turn support for non-harmony requests (#29175)
- Reasoning item input parsing (#28248)
Tool Calling:
- Parsed tool arguments support (#28820)
- parallel_tool_calls param compliance (#26233)
- Tool filtering support in ToolServer (#29224)
Whisper: verbose_json and timestamp features for transcription/translation (#24209).
Sampling: Flat logprob control moved from env var to SamplingParams (#28914).
GGUF: Improved HuggingFace loading UX with repo_id:quant_type syntax (#29137).
Profiling: Iteration-level profiling for Torch and CUDA profiler (#28987).
Logs: Colorized log output (#29017).
Optimization Levels: -O0, -O1, -O2, -O3 allow trading startup time for performance, more compilation flags will be added in future releases (#26847)

Dependencies

PyTorch 2.9.0 with CUDA 12.9 (#24994) - Breaking change requiring environment updates.
xgrammar: Updated to 0.1.27 (#28221).
Transformers: Updated to 4.57.3 (#29418), preparation for v5 with rope_parameters (#28542).
XPU: torch & IPEX 2.9 upgrade (#29307).

V0 Deprecation & Breaking Changes

Removed Parameters:

num_lookahead_slots (#29000)
best_of (#29090)
LoRA extra vocab (#28545)

Deprecated:

xformers backend (#29262)
seed=None (#29185)

Scheduled Removals (will be removed in future release):

ParallelConfig's direct child EPLB fields (#29324)
guided_* config fields (#29326)
override_pooler_config and disable_log_requests (#29402)
CompilationConfig.use_inductor (#29323)
Deprecated metrics (#29330)

Other Breaking Changes:

PyTorch 2.9.0 upgrade requires CUDA 12.9 environment
Mistral format auto-detection for model loading (#28659)

New Contributors

@jesse996 made their first contribution in #28846
@Nepherpitou made their first contribution in #28960
@Samoed made their first contribution in #27329
@j20120307 made their first contribution in #28999
@vnadathur made their first contribution in #26468
@zhyajie made their first contribution in #28942
@IzzyPutterman made their first contribution in #28896
@rjrock-amd made their first contribution in #28905
@zq1997 made their first contribution in #27715
@shengliangxu made their first contribution in #28076
@prashanth058 made their first contribution in #28972
@qgallouedec made their first contribution in #28820
@zhanggzh made their first contribution in #19347
@pandalee99 made their first contribution in #26628
@dsuhinin made their first contribution in #29100
@xli made their first contribution in #29124
@jeremyteboul made their first contribution in #29059
@soodoshll made their first contribution in #28875
@bhagyashrigai made their first contribution in #28957
@skaraban3807 made their first contribution in #25559
@Victor49152 made their first contribution in #28892
@rjrock made their first contribution in #29205
@FlintyLemming made their first contribution in #29182
@madskildegaard made their first contribution in #29175
@nandan2003 made their first contribution in #29189
@michaelact made their first contribution in #29173
@yongming-qin made their first contribution in #28958
@joshiemoore made their first contribution in #29249
@lim4349 made their first contribution in #29068
@apinge made their first contribution in #28376
@gbyu-amd made their first contribution in #28032
@kflu made their first contribution in #29364
@Inokinoki made their first contribution in #29200
@GOavi101 made their first contribution in #29313
@sts07142 made their first contribution in #29137
@ivanium made their first contribution in #29143
@geodavic...

Contributors

xli, rgommers, and 57 other contributors

Assets 6

20 Nov 07:29

khluu

v0.11.2

275de34

v0.11.2

This release includes 4 bug fixes on top of v0.11.1:

[BugFix] Ray with multiple nodes (#28873)
[BugFix] Fix false assertion with spec-decode=[2,4,..] and TP>2 (#29036)
[BugFix] Fix async-scheduling + FlashAttn MLA (#28990)
[NVIDIA] Guard SM100 CUTLASS MoE macro to SM100 builds v2 (#28938)

Assets 7

18 Nov 23:03

khluu

v0.11.1

4393684

v0.11.1

Highlights

This release includes 1456 commits from 449 contributors (184 new contributors)!

Key changes include:

PyTorch 2.9.0 + CUDA 12.9.1: Updated the default CUDA build to torch==2.9.0+cu129, enabling Inductor partitioning and landing multiple fixes in graph-partition rules and compile-cache integration.
Batch-invariant torch.compile: Generalized batch-invariant support across attention and MoE backends, with explicit support for DeepGEMM and FlashInfer on Hopper and Blackwell GPUs.
Robust async scheduling: Fixed several correctness and stability issues in async scheduling, especially when combined with chunked prefill, structured outputs, priority scheduling, MTP, and DeepEP / DCP. We expect --async-scheduling to be enabled by default in the next release.
Stronger scheduler + KV ecosystem: Improved test coverage in CI and made scheduler behavior more robust with KV connectors, prefix caching, and multi-node deployments.
Anthropic API Support: Added support for the /v1/messages endpoint, allowing users to interact with vllm serve using Anthropic-compatible clients.

Detailed release notes will be updated in the next few days.

What's Changed

[Bugfix] Improve GLM4 MoE Reasoning Parser's is_reasoning_end Condition (@frankwang28 #25355)
[Docs] Add Toronto Meetup (@mgoin #25773)
[CI] Add E2E Blackwell Quantized MoE Test (@mgoin #25723)
[V1] address post issues related to #20059 (part 1); cascade attention reenable by default (@fhl2000 #23046)
[CI] Fix FlashInfer AOT in release docker image (@mgoin #25730)
[spec decode] Consolidate speculative decode method name for MTP (@zixi-qi #25232)
Reduce the Cuda Graph memory footprint when running with DBO (@SageMoore #25779)
Kernel-override Determinism [1/n] (@bwasti #25603)
[Bugfix] Optimize CpuGpuBuffer initialization (@namanlalitnyu #25447)
[Spec decode] automatically disable mm for text-only draft models (@jmkuebler #25667)
[Core] Don't count preempted tokens in prefix cache hit rate (@zhuohan123 #25787)
Add option to restrict media domains (@russellb #25783)
Add flashinfer-build.sh and register precompiled cu128 wheel in Dockerfile (@mgoin #25782)
[Multimodal][Speculative Decoding]Eagle Eagle3 mm support, enablement on qwen2.5vl (@david6666666 #22872)
[Bugfix] Allow Only SDPA Backend for ViT on B200 for Qwen3-VL (@yewentao256 #25788)
[CI/Build] Consolidate model loader tests and requirements (@DarkLight1337 #25765)
[CI/Build] Add timing to Model Executor Test (@22quinn #25799)
[CI/Build] Reorganize root-level V1 tests (@DarkLight1337 #25767)
[Misc] Fix codeowners override for v1 sample and attention (@22quinn #25037)
[Misc] Update openai client example file for multimodal (@ywang96 #25795)
[Bugfix] Add missing image_size for phi4_multimodal (@Renovamen #25796)
[Bugfix] Merge MM embeddings by index instead of token IDs (@DarkLight1337 #16229)
Validate API tokens in constant time (@russellb #25781)
Add filtering for chat template kwargs (@russellb #25794)
Fix GPTQ model loading in Transformers backend (@hmellor #25770)
[Bugfix] Fix triton import precommit failure (@tlrmchlsmth #25803)
[Bugfix][WideEP] Apply TP Attn + EP MoE fix to other models (@tlrmchlsmth #24982)
[docs] Resolve transcriptions API TODO (@yyzxw #25446)
[env] default nixl side port conflicts with kv-event zmq port (@panpan0000 #25056)
[Core] Refactor self.model() to call a helper for subclassing. (@patrick-toulme #25084)
[torch.compile]: Add VLLM_DEBUG_DUMP_PATH environment variable (@ZJY0516 #25651)
[Bug]: Set LD_LIBRARY_PATH to include the 'standard' CUDA location (@smarterclayton #25766)
[Core] GC Debug callback (@Jialin #24829)
[Bugfix][NIXL] Fix Async Scheduler timeout issue (@NickLucche #25808)
[MM] Optimize memory profiling for scattered multimodal embeddings (@ywang96 #25810)
[Bugfix] Fix Qwen3-VL regression from #24982 (@ywang96 #25814)
[VLM] Update Qwen3-VL max_num_video_tokens calculation for configurable video profiling (@Isotr0py #25557)
Fix random dataset mismatched token length with config. (@weireweire #24937)
Update GLM-4.5 Doc transformers version (@zRzRzRzRzRzRzR #25830)
[Bugfix] fix Qwen3VLMoe load when pp > 1 (@JJJYmmm #25838)
Remove redundant cudagraph dispatcher warning (@mgoin #25841)
[Misc] fix tests failure by using current_platform (@kingsmad #25825)
[P/D] NIXL Updates (@robertgshaw2-redhat #25844)
Add Phi4FlashForCausalLM to _PREVIOUSLY_SUPPORTED_MODELS (@tdoublep #25832)
[XPU]Fix xpu spec decoding UTs, avoid using cuda graph (@jikunshang #25847)
[Bugfix] Fallback ViT attn backend to SDPA for blackwell (@ywang96 #25851)
[V0 Deprecation][Models] Remove all V0 condition for mm embeddings merge (@Isotr0py #25331)
[Misc] Remove more get_input_embeddings_v0 (@DarkLight1337 #25857)
update to latest deepgemm for dsv3.2 (@youkaichao #25871)
[Bugfix] Fix requirements paths in install instructions (@yingjun-mou #25827)
[Model][Bugfix] Fix issues in MiDashengLM implementation for quantized models (@zhoukezi #25854)
[torch.compile] serialize cudagraph_mode as its enum name instead of value (@ZJY0516 #25868)
[Cuda2CPU][P/D] Add cuda2cpu support in NixlConnector (@chenxi-yang #24690)
[Bugfix][Speculative Decoding] Fix Eagle3 quantization config issue (@rahul-tuli #25883)
[CI/Build] Include Transformers backend test in nightly transformers test (@Isotr0py #25885)
[Model] Remove MotifForCausalLM (@jeejeelee #25866)
[Bugfix] Use correct key "ignore" for config.json non-quantized layers (@leejnau #25706)
[BugFix][torch.compile] KV scale calculation issues with FP8 quantization (#21640) (@adabeyta #25513)
[Doc] Add documentation for vLLM continuous benchmarking and profiling (@namanlalitnyu #25819)
[Bugfix][ROCm] Fixing trying to import non-existent symbols from libnccl.so (@gshtras #25605)
[Kernel] Chunk-aligned mamba2 (@tdoublep #24683)
[Doc] Polish example for torchrun dp (@zhuohan123 #25899)
[NIXL] Increase default KV block eviction timeout on P (@NickLucche #25897)
[V0 Deprecation] Remove vllm.worker and update according imports (@aarnphm #25901)
Test Prompt Embeds/LoRA compatibility and Enable LoRA Support for OPT Models (@qthequartermasterman #25717)
[Bug] Fix Weight Loading for Block FP8 Cutlass SM90 (@yewentao256 #25909)
[Benchmark] Support benchmark throughput for external launcher DP (@zhuohan123 #25913)
MoveVllmConfig from config/__init__.py to config/vllm.py (@hmellor #25271)
[BugFix] Fix DP/EP hang (@LucasWilkinson #25906)
[BugFix] Pass config_format via try_get_generation_config (@acisseJZhong #25912)
[Model][Bugfix] Fix MiDashengLM audio encoder mask by removing incorrect logical_not (@zhoukezi #25925)
[Bugfix]: Clean up chunked prefill logging when using whisper (@simondanielsson #25075)
[New Model] DeepSeek-V3.2 (Rebased to Main) (@zyongye #25896)
[Doc] Add Cambricon MLU support (@a120092009 #25942)
Updated TRL integration docs (@sergiopaniego #25684)
[Bugfix][Model]fix ernie45 moe gate&bias dtype to float32 (@CSWYF3634076 #25936)
[Model] Move vision_feature_select_strategy into resolve_visual_encoder_outputs (@DarkLight1337 #25938)
[perf] Use CPU tensor to reduce GPU->CPU sync (@lhtin #25884)
[NIXL] Add support for MLA caches with different latent dim (@NickLucche #25902)
[CI] Move applicable tests to CPU (@rzabarazesh #24080)
[Fix] Improve CPU backend compatibility for RISC-V (@ihb2032 #25816)
[Kernel][Moe Configs] Add more tuned triton configs for ExpertsInt8 and FP8 (@Josephasafg #25858)
Add Hugging Face Inference Endpoints guide to Deployment docs (@sergiopaniego #25886)
[Bugfix][Model] Fix inference for Hunyuan dense models (@Anionex #25354)
[Bugfix] Fix accuracy issue of TRTLLM FP8 MOE and improve logging (@pavanimajety #25895)
[Bugfix] Token type and position embeddings fail to be applied to inputs_embeds (@DarkLight1337 #25922)
[bugfix][deepseek] fix flashmla kernel selection (@youkaichao #25956)
[Bug] Fix AttributeError: 'QKVParallelLinear' object has no attribute 'orig_dtype' (@yewentao256 #25958)
[Doc] Improve MM Pooling model documentation (@DarkLight1337 #25966)
[Docs] Add moe kernel features doc (@bnellnm #25297)
OffloadingConnector: Fix GPU block tracking bug (@orozery #25856)
[Llama4] [multimodal] Fix misplaced dtype cast of cos_sin_cache in Llama4VisionRotaryEmbedding (@cjackal #25889)
[Bench] Add DeepSeekV32 to MoE benchmark (@jeejeelee #25962)
[V1] [P/D] Add Support for KV Load Failure Recovery (@sdavidbd #19330)
Add explicit pooling classes for the Transformers backend (@hmellor #25322)
[Docs] Remove API Reference from search index (@hmellor #25949)
[gpt-oss] use vLLM instead of openai types for streaming (@qandrew #25186)
[Misc] Make EP kernels install script support uv (@LucasWilkinson #25785)
[Model] MTP fallback to eager for DeepSeek v32 (@luccafong #25982)
Update launch_bounds_utils.h for correct compile on Multiple Cuda Arch - PTXAS out of range Warning (@DrStone1971 #25843)
[Log] Optimize Log for FP8MOE (@yewentao256 #25709)
Fix INT8 quantization error on Blackwell GPUs (SM100+) (@certainly-param #25935)
[MM] Add text-only mode for Qwen3-VL (@ywang96 #26000)
[Bugfix] Fix __syncwarp on ROCM (@zhewenl #25996)
[BugFix] Fix default kv-cache-dtype default for DeepseekV3.2 (@LucasWilkinson #25988)
Update to Transformers v4.56.2 (@hmellor #24638)
[Misc]allow disable pynccl (@luccafong #25421)
[Doc] updating torch.compile doc link #25989)
[BugFix][MM] Fix Nonetype error when video is cache in qwen2.5-omni-thinker (@wwl2755 #26004)
[Misc] Factor out common _apply_feature_select_strategy (@DarkLight1337 #26003)
[CI] Only capture a single CUDA graph size in CI by default (@hmellor #25951)
[MISC] Fix misleading batch_size_capture_lis...

Contributors

markmc, bbrowning, and 198 other contributors

Assets 7

02 Oct 19:17

simon-mo

v0.11.0

b8b302c

v0.11.0

Highlights

This release features 538 commits, 207 contributors (65 new contributors)!

This release completes the removal of V0 engine. V0 engine code including AsyncLLMEngine, LLMEngine, MQLLMEngine, all attention backends, and related components have been removed. V1 is the only engine in the codebase now.
This releases turns on FULL_AND_PIECEWISE as the CUDA graph mode default. This should provide better out of the box performance for most models, particularly fine-grained MoEs, while preserving compatibility with existing models supporting only PIECEWISE mode.

Note: In v0.11.0 (and v0.10.2), --async-scheduling will produce gibberish output in some cases such as preemption and others. This functionality is correct in v0.10.1. We are actively fixing it for the next version.

Model Support

New architectures: DeepSeek-V3.2-Exp (#25896), Qwen3-VL series (#24727), Qwen3-Next (#24526), OLMo3 (#24534), LongCat-Flash (#23991), Dots OCR (#24645), Ling2.0 (#24627), CWM (#25611).
Encoders: RADIO encoder support (#24595), Transformers backend support for encoder-only models (#25174).
Task expansion: BERT token classification/NER (#24872), multimodal models for pooling tasks (#24451).
Data parallel for vision encoders: InternVL (#23909), Qwen2-VL (#25445), Qwen3-VL (#24955).
Speculative decoding: EAGLE3 for MiniCPM3 (#24243) and GPT-OSS (#25246).
Features: Qwen3-VL text-only mode (#26000), EVS video token pruning (#22980), Mamba2 TP+quantization (#24593), MRoPE + YaRN (#25384), Whisper on XPU (#25123), LongCat-Flash-Chat tool calling (#24083).
Performance: GLM-4.1V 916ms TTFT reduction via fused RMSNorm (#24733), GLM-4 MoE SharedFusedMoE optimization (#24849), Qwen2.5-VL CUDA sync removal (#24741), Qwen3-VL Triton MRoPE kernel (#25055), FP8 checkpoints for Qwen3-Next (#25079).
Reasoning: SeedOSS reason parser (#24263).

Engine Core

KV cache offloading: CPU offloading with LRU management (#19848, #20075, #21448, #22595, #24251).
V1 features: Prompt embeddings (#24278), sharded state loading (#25308), FlexAttention sliding window (#24089), LLM.apply_model (#18465).
Hybrid allocator: Pipeline parallel (#23974), varying hidden sizes (#25101).
Async scheduling: Uniprocessor executor support (#24219).
Architecture: Tokenizer group removal (#24078), shared memory multimodal caching (#20452).
Attention: Hybrid SSM/Attention in Triton (#21197), FlashAttention 3 for ViT (#24347).
Performance: FlashInfer RoPE 2x speedup (#21126), fused Q/K RoPE 11% improvement (#24511, #25005), 8x spec decode overhead reduction (#24986), FlashInfer spec decode with 1.14x speedup (#25196), model info caching (#23558), inputs_embeds copy avoidance (#25739).
LoRA: Optimized weight loading (#25403).
Defaults: CUDA graph mode FULL_AND_PIECEWISE (#25444), Inductor standalone compile disabled (#25391).
torch.compile: CUDA graph Inductor partition integration (#24281).

Hardware & Performance

NVIDIA: FP8 FlashInfer MLA decode (#24705), BF16 fused MoE for Hopper/Blackwell expert parallel (#25503).
DeepGEMM: Enabled by default (#24462), 5.5% throughput improvement (#24783).
New architectures: RISC-V 64-bit (#22112), ARM non-x86 CPU (#25166), ARM 4-bit fused MoE (#23809).
AMD: ROCm 7.0 (#25178), GLM-4.5 MI300X tuning (#25703).
Intel XPU: MoE DP accuracy fix (#25465).

Large Scale Serving & Performance

Dual-Batch Overlap (DBO): Overlapping computation mechanism (#23693), DeepEP high throughput + prefill (#24845).
Data Parallelism: torchrun launcher (#24899), Ray placement groups (#25026), Triton DP/EP kernels (#24588).
EPLB: Hunyuan V1 (#23078), Mixtral (#22842), static placement (#23745), reduced overhead (#24573).
Disaggregated serving: KV transfer metrics (#22188), NIXL MLA latent dimension (#25902).
MoE: Shared expert overlap optimization (#24254), SiLU kernel for DeepSeek-R1 (#24054), Enable Allgather/ReduceScatter backend for NaiveAllToAll (#23964).
Distributed: NCCL symmetric memory with 3-4% throughput improvement (#24532), enabled by default for TP (#25070).

Quantization

FP8: Per-token-group quantization (#24342), hardware-accelerated instructions (#24757), torch.compile KV cache (#22758), paged attention update (#22222).
FP4: NVFP4 for dense models (#25609), Gemma3 (#22771), Llama 3.1 405B (#25135).
W4A8: Faster preprocessing (#23972).
Compressed tensors: Blocked FP8 for MoE (#25219).

API & Frontend

OpenAI: Prompt logprobs for all tokens (#24956), logprobs=-1 for full vocab (#25031), reasoning streaming events (#24938), Responses API MCP tools (#24628, #24985), health 503 on dead engine (#24897).
Multimodal: Media UUID caching (#23950), image path format (#25081).
Tool calling: XML parser for Qwen3-Coder (#25028), Hermes-style tokens (#25281).
CLI: --enable-logging (#25610), improved --help (#24903).
Config: Speculative model engine args (#25250), env validation (#24761), NVTX profiling (#25501), guided decoding backward compatibility (#25615, #25422).
Metrics: V1 TPOT histogram (#24015), hidden deprecated gpu_ metrics (#24245), KV cache GiB units (#25204, #25479).
UX: Removed misleading quantization warning (#25012).

Security

GHSA-wr9h-g72x-mwhm

Dependencies

PyTorch 2.8 for CPU (#25652), FlashInfer 0.3.1 (#24470), CUDA 13 (#24599), ROCm 7.0 (#25178).
Build requirements: C++17 now enforced globally (#24823).
TPU: Deprecated xm.mark_step in favor of torch_xla.sync (#25254).

V0 Deprecation

Engines: AsyncLLMEngine (#25025), LLMEngine (#25033), MQLLMEngine (#25019), core (#25321), model runner (#25328), MP executor (#25329).
Components: Attention backends (#25351), encoder-decoder (#24907), output processor (#25320), sampling metadata (#25345), Sequence/Sampler (#25332).
Interfaces: LoRA (#25686), async output processor (#25334), MultiModalPlaceholderMap (#25366), seq group methods (#25330), placeholder attention (#25510), input embeddings (#25242), multimodal registry (#25362), max_seq_len_to_capture (#25543), attention classes (#25541), hybrid models (#25400), backend suffixes (#25489), compilation fallbacks (#25675), default args (#25409).

What's Changed

[Qwen3-Next] MoE configs for H20 TP=1,2,4,8 by @jeejeelee in #24707
[DOCs] Update ROCm installation docs section by @gshtras in #24691
Enable conversion of multimodal models to pooling tasks by @maxdebayser in #24451
Fix implementation divergence for BLOOM models between vLLM and HuggingFace when using prompt embeds by @qthequartermasterman in #24686
[Bugfix] Fix MRoPE dispatch on CPU by @bigPYJ1151 in #24712
[BugFix] Fix Qwen3-Next PP by @njhill in #24709
[CI] Fix flaky test v1/worker/test_gpu_model_runner.py::test_kv_cache_stride_order by @heheda12345 in #24640
[CI] Add ci_envs for convenient local testing by @noooop in #24630
[CI/Build] Skip prompt embeddings tests on V1-only CPU backend by @bigPYJ1151 in #24721
[Misc][gpt-oss] Add gpt-oss label to PRs that mention harmony or related to builtin tool call by @heheda12345 in #24717
[Bugfix] Fix BNB name match by @jeejeelee in #24735
[Kernel] [CPU] refactor cpu_attn.py:_run_sdpa_forward for better memory access by @ignaciosica in #24701
[sleep mode] save memory for on-the-fly quantization by @youkaichao in #24731
[Multi Modal] Add FA3 in VIT by @wwl2755 in #24347
[Multimodal] Remove legacy multimodal fields in favor of MultiModalFeatureSpec by @sfeng33 in #24548
[Doc]: fix typos in various files by @didier-durand in #24726
[Docs] Fix warnings in mkdocs build (continued) by @Zerohertz in #24740
[Bugfix] Fix MRoPE dispatch on XPU by @yma11 in #24724
[Qwen3-Next] MoE configs for H100 TP=1,2 and TP2/EP by @elvircrn in #24739
[Core] Shared memory based object store for Multimodal data caching and IPC by @dongluw in #20452
[Bugfix][Frontend] Fix --enable-log-outputs does not match the documentation by @kebe7jun in #24626
[Models] Optimise and simplify _validate_and_reshape_mm_tensor by @lgeiger in #24742
[Models] Prevent CUDA sync in Qwen2.5-VL by @lgeiger in #24741
[Model] Switch to Fused RMSNorm in GLM-4.1V model by @SamitHuang in #24733
[UX] Remove AsyncLLM torch profiler disabled log by @mgoin in #24609
[CI] Speed up model unit tests in CI by @afeldman-nm in #24253
[Bugfix] Fix incompatibility between #20452 and #24548 by @DarkLight1337 in #24754
[CI] Trigger BC Linter when labels are added/removed by @zhewenl in #24767
[Benchmark] Allow arbitrary headers to be passed to benchmarked endpoints by @smarterclayton in #23937
[Compilation Bug] Fix Inductor Graph Output with Shape Issue by...

Contributors

markmc, bbrowning, and 198 other contributors

Assets 6

13 Sep 06:37

simon-mo

v0.10.2

01efc7e

v0.10.2

Highlights

This release contains 740 commits from 266 contributors (97 new)!

Breaking Changes: This release includes PyTorch 2.8.0 upgrade, V0 deprecations, and API changes - please review the changelog carefully.

aarch64 support: This release features native support for aarch64 allowing usage of vLLM on GB200 platform. The docker image vllm/vllm-openai should already be multiplatform. To install the wheels, you can download the wheels from this release artifact or install via

uv pip install vllm==0.10.2 --extra-index-url https://wheels.vllm.ai/0.10.2/ --torch-backend=auto

Model Support

New model families and enhancements: Apertus (#23068), LFM2 (#22845), MiDashengLM (#23652), Motif-1-Tiny (#23414), Seed-Oss (#23241), Google EmbeddingGemma-300m (#24318), GTE sequence classification (#23524), Donut OCR model (#23229), KeyeVL-1.5-8B (#23838), R-4B vision model (#23246), Ernie4.5 VL (#22514), MiniCPM-V 4.5 (#23586), Ovis2.5 (#23084), Qwen3-Next with hybrid attention (#24526), InternVL3.5 with video support (#23658), Qwen2Audio embeddings (#23625), NemotronH Nano VLM (#23644), BLOOM V1 engine support (#23488), and Whisper encoder-decoder for V1 (#21088).
Pipeline parallelism expansion: Added PP support for Hunyuan (#24212), Ovis2.5 (#23405), GPT-OSS (#23680), and Kimi-VL-A3B-Thinking-2506 (#23114).
Data parallelism for vision models: Enabled DP for ViT across Qwen2.5VL (#22742), MiniCPM-V (#23948, #23327), Kimi-VL (#23817), and GLM-4.5V (#23168).
LoRA ecosystem expansion: Added LoRA support to Voxtral (#24517), Qwen-2.5-Omni (#24231), and DeepSeek models V2/V3/R1-0528 (#23971), with significantly faster LoRA startup performance (#23777).
Classification and pooling enhancements: Multi-label classification support (#23173), logit bias and sigmoid normalization (#24031), and FP32 precision heads for pooling models (#23810).
Performance optimizations: Removed unnecessary CUDA sync from GLM-4.1V (#24332) and Qwen2VL (#24334) preprocessing, eliminated redundant all-reduce in Qwen3 MoE (#23169), optimized InternVL CPU threading (#24519), and GLM4.5-V video frame decoding (#24161).

Engine Core

V1 engine maturation: Extended V1 support to compute capability < 8.0 (#23614, #24022), added cross-attention KV cache for encoder-decoder models (#23664), request-level logits processor integration (#23656), and KV events from connectors (#19737).
Backend expansion: Terratorch backend integration (#23513) enabling non-language model tasks like semantic segmentation and geospatial applications with --model-impl terratorch support.
Hybrid and Mamba model improvements: Enabled full CUDA graphs by default for hybrid models (#22594), disabled prefix caching for hybrid/Mamba models (#23716), added FP32 SSM kernel support (#23506), full CUDA graph support for Mamba1 (#23035), and V1 as default for Mamba models (#23650).
Performance core improvements: --safetensors-load-strategy for NFS based file loading acceleration (#24469), critical CUDA graph capture throughput fix (#24128), scheduler optimization for single completions (#21917), multi-threaded model weight loading (#23928), and tensor core usage enforcement for FlashInfer decode (#23214).
Multimodal enhancements: Multimodal cache tracking with mm_hash (#22711), UUID-based multimodal identifiers (#23394), improved V1 video embedding estimation (#24312), and simplified multimodal UUID handling (#24271).
Sampling and structured outputs: Support for all prompt logprobs (#23868), final logprobs (#22387), grammar bitmask optimization (#23361), and user-configurable KV cache memory size (#21489).
Distributed: Support Decode Context Parallel (DCP) for MLA (#23734)

Hardware & Performance

NVIDIA Blackwell/SM100 generation: FP8 MLA support with CUTLASS backend (#23289), DeepGEMM Linear with 1.5% E2E throughput improvement (#23351), Hopper DeepGEMM E8M0 for DeepSeekV3.1 (#23666), SM100 FlashInfer CUTLASS MoE FP8 backend (#22357), MXFP4 fused CUTLASS MoE (#23696), default MXFP4 MoE on Blackwell (#23008), and GPT-OSS DP/EP support with 52,003 tokens/s throughput (#23608).
Breaking change: FlashMLA disabled on Blackwell GPUs due to compatibility issues (#24521).
Kernel and attention optimizations: FlashAttention MLA with CUDA graph support (#14258, #23958), V1 cross-attention support (#23297), FP8 support for FlashMLA (#22668), fused grouped TopK for MoE (#23274), Flash Linear Attention kernels (#24518), and W4A8 support on Hopper (#23198).
Performance improvements: 13.7x speedup for token conversion (#20413), TTIT/TTFT improvements for disaggregated serving (#22760), symmetric memory all-reduce by default (#24111), FlashInfer warmup during startup (#23439), V1 model execution overlap (#23569), and various Triton configuration tuning (#23748, #23939).
Platform expansion: Apple Silicon bfloat16 support for M2+ (#24129), IBM Z V1 engine support (#22725), Intel XPU torch.compile (#22609), XPU MoE data parallelism (#22887), XPU Triton attention (#24149), XPU FP8 quantization (#23148), and ROCm pipeline parallelism with Ray (#24275).
Model-specific optimizations: Hardware-tuned MoE configurations for Qwen3-Next on B200/H200/H100 (#24698, #24688, #24699, #24695), GLM-4.5-Air-FP8 B200 configs (#23695), Kimi K2 optimization (#24597), and QWEN3 Coder/Thinking configs (#24266, #24330).

Quantization

New quantization capabilities: Per-layer quantization routing (#23556), GGUF quantization with layer skipping (#23188), NFP4+FP8 MoE support (#22674), W4A8 channel scales (#23570), and AMD CDNA2/CDNA3 FP4 support (#22527).
Advanced quantization infrastructure: Compressed tensors transforms for linear operations (#22486) enabling techniques like SpinQuantR1R2R4 and QuIP quantization methods.
FlashInfer quantization integration: FP8 KV cache for TRTLLM prefill attention (#24197), FP8-qkv attention kernels (#23647), and FP8 per-tensor GEMMs (#22895).
Platform-specific quantization: ROCm TorchAO quantization enablement (#24400) and TorchAO module swap configuration (#21982).
Performance optimizations: MXFP4 MoE loading cache optimization (#24154) and compressed tensors version updates (#23202).
Breaking change: Removed original Marlin quantization format (#23204).

API & Frontend

OpenAI API enhancements: Gemma3n audio transcription/translation endpoints (#23735), transcription response usage statistics (#23576), and return_token_ids parameter (#22587).
Response API improvements: Streaming support for non-harmony responses (#23741), non-streaming logprobs (#23319), MCP tool background mode (#23494), MCP streaming+background support (#23927), and tool output token reporting (#24285).
Frontend optimizations: Error stack traces with --log-error-stack (#22960), collective RPC endpoint (#23075), beam search concurrency optimization (#23599), unnecessary detokenization skipping (#24236), and custom media UUIDs (#23449).
Configuration enhancements: Formalized --mm-encoder-tp-mode flag (#23190), VLLM_DISABLE_PAD_FOR_CUDAGRAPH environment variable (#23595), EPLB configuration parameter (#20562), embedding endpoint chat request support (#23931), and LM Format Enforcer V1 integration (#22564).

Dependencies

Major updates: PyTorch 2.8.0 upgrade (#20358) - breaking change requiring environment updates, FlashInfer v0.3.0 upgrade (#24086), and FlashInfer 0.2.14.post1 maintenance update (#23537).
Supporting updates: XGrammar 0.1.23 (#22988), TPU core dump fix with tpu_info 0.4.0 (#23135), and compressed tensors version bump (#23202).
Deployment improvements: FlashInfer cubin directory environment variable (#22675) for offline environments and pre-cached CUDA binaries.

V0 Deprecation

Backend removals: V0 Neuron backend deprecation (#21159), V0 pooling model support removal (#23434), V0 FlashInfer attention backend removal (#22776), and V0 test cleanup (#23418, #23862).
API breaking changes: prompt_token_ids fallback removal from LLM.generate and LLM.embed (#18800), LoRA extra vocab size deprecation warning (#23635), LoRA bias parameter deprecation (#24339), and metrics naming change from TPOT to ITL (#24110).

Breaking Changes

PyTorch 2.8.0 upgrade - Environment dependency change requiring updated CUDA versions
FlashMLA Blackwell restriction - FlashMLA disabled on Blackwell GPUs due to compatibility issues
V0 feature removals - Neuron backend, pooling models, FlashInfer attention backend
Quantizations - Removed quantized Mixtral hack implementation, and original Marlin format.
Metrics renaming - TPOT deprecated in favor of ITL

What's Changed

[Misc] Minor code cleanup for _get_prompt_logprobs_dict by @WoosukKwon in #23064
[Misc] enhance static type hint by @andyxning in #23059
[Bugfix] fix Qwen2.5-Omni processor output mapping by @DoubleVII in #23058
[Bugfix][CI] Machete kernels: deterministic ordering for more cache hits by @andylolu2 in #23055
[Misc] refactor function name by @andyxning in #23029
[Misc] Fix backward compatibility from #23030 by @ywang96 in #23070
[XPU] Fix compile size for xpu by @jikunshang in #23069
[XPU][CI]add xpu env vars in CI scripts by @jikunshang in #22946
[Refactor] Define MultiModalKwargsItems separate from MultiModalKwargs by @DarkLight1337 in #23053
[Bugfix] fix IntermediateTensors equal method by @andyxning in https://github.com/vllm-project/...

Contributors

markmc, rasmith, and 197 other contributors

Assets 6

20 Aug 21:20

github-actions

v0.10.1.1

1da94e6

v0.10.1.1

This is a critical bugfix and security release:

Fix CUTLASS MLA Full CUDAGraph (#23200)
Limit HTTP header count and size (#23267): GHSA-rxc4-3w6r-4v47
Do not use eval() to convert unknown types (#23266): GHSA-79j6-g2m3-jgfw

Full Changelog: v0.10.1...v0.10.1.1

Assets 6

18 Aug 04:39

github-actions

v0.10.1

aab5498

v0.10.1

Highlights

v0.10.1 release includes 727 commits, 245 committers (105 new contributors).

NOTE: This release deprecates V0 FA3 support and as a result FP8 kv-cache in V0 may have issues

Model Support

New model families: GPT-OSS with comprehensive tool calling and streaming support (#22327, #22330, #22332, #22335, #22339, #22340, #22342), Command-A-Vision (#22660), mBART (#22883), and SmolLM3 using Transformers backend (#22665).
Vision-language models: Official Eagle multimodal support with Llama4 backend (#20788), Step3 vision-language models (#21998), Gemma3n multimodal (#20495), MiniCPM-V 4.0 (#22166), HyperCLOVAX-SEED-Vision-Instruct-3B (#20931), Emu3 with Transformers backend (#21319), Intern-S1 (#21628), and Prithvi in online serving mode (#21518).
Enhanced existing models: NemotronH support (#22349), Ernie 4.5 Base 0.3B model name change (#21735), GLM-4.5 series improvements (#22215), Granite models with fused MoE configurations (#21332) and quantized checkpoint loading (#22925), Ultravox support for Llama 4 and Gemma 3 backends (#17818), Mamba1 and Jamba model support in V1 (without CUDA graphs) (#21249)
Advanced model capabilities: Qwen3 EPLB (#20815) and dual-chunk attention support (#21924), Qwen native Eagle3 target support (#22333).
Architecture expansions: Encoder-only models without KV-cache enabling BERT-style architectures (#21270), expanded tensor parallelism support in Transformers backend (#22651), tensor parallelism for Deepseek_vl2 vision transformer (#21494), and tensor/pipeline parallelism with Mamba2 kernel for PLaMo2 (#19674).
V1 engine compatibility: Extended support for additional pooling models (#21747) and Step3VisionEncoder distributed processing option (#22697).

Engine Core

CUDA graph performance: Full CUDA graph support with separate attention routines, adding FA2 and FlashInfer compatibility (#20059), plus 6% end-to-end throughput improvement from Cutlass MLA (#22763).
Attention system advances: Multiple attention metadata builders per KV cache specification (#21588), tree attention backend for v1 engine (experimental) (#20401), FlexAttention encoder-only support (#22273), upgraded FlashAttention 3 with attention sink support (#22313), and multiple attention groups for KV sharing patterns (#22672).
Speculative decoding optimizations: N-gram speculative decoding with single KMP token proposal algorithm (#22437), explicit EAGLE3 interface for enhanced compatibility (#22642).
Default behavior improvements: Pooling models now default to chunked prefill and prefix caching (#20930), disabled chunked local attention by default for Llama4 for better performance (#21761).
Extensibility and configuration: Model loader plugin system (#21067), custom operations support for FusedMoe (#22509), rate limiting with bucket algorithm for proxy server (#22643), torch.compile support for bailing MoE (#21664).
Performance optimizations: Improved startup time by disabling C++ compilation of symbolic shapes (#20836), enhanced headless models for pooling in Transformers backend (#21767).

Hardware & Performance

NVIDIA Blackwell (SM100) optimizations: CutlassMLA as default backend (#21626), FlashInfer MoE per-tensor scale FP8 backend (#21458), SM90 CUTLASS FP8 GEMM with kernel tuning and swap AB support (#20396).
NVIDIA RTX 5090/RTX PRO 6000 (SM120) support: Block FP8 quantization (#22131) and CUTLASS NVFP4 4-bit weights/activations support (#21309).
AMD ROCm platform enhancements: Flash Attention backend for Qwen-VL models (#22069), AITER HIP block quantization kernels (#21242), reduced device-to-host transfers (#22683), and optimized kernel performance for small batch sizes 1-4 (#21350).
Attention and compute optimizations: FlashAttention 3 attention sinks performance boost (#22478), Triton-based multi-dimensional RoPE replacing PyTorch implementation (#22375), async tensor parallelism for scaled matrix multiplication (#20155), optimized FlashInfer metadata building (#21137).
Memory and throughput improvements: Mamba2 reduced device-to-device copy overhead (#21075), fused Triton kernels for RMSNorm (#20839, #22184), improved multimodal hasher performance for repeated image prompts (#22825), multithreaded async multimodal loading (#22710).
Parallelization and MoE optimizations: Guided decoding throughput improvements (#21862), balanced expert sharding for MoE models (#21497), expanded fused kernel support for topk softmax (#22211), fused MoE for nomic-embed-text-v2-moe (#18321).
Hardware compatibility and kernels: ARM CPU build fixes for systems without BF16 support (#21848), Machete memory-bound performance improvements (#21556), FlashInfer TRT-LLM prefill attention kernel support (#22095), optimized reshape_and_cache_flash CUDA kernel (#22036), CPU transfer support in NixlConnector (#18293).
Specialized CUDA kernels: GPT-OSS activation functions (#22538), RLHF weight loading acceleration (#21164).

Quantization

Advanced quantization techniques: MXFP4 and bias support for Marlin kernel (#22428), NVFP4 GEMM FlashInfer backends (#22346), compressed-tensors mixed-precision model loading (#22468), FlashInfer MoE support for NVFP4 (#21639).
Hardware-optimized quantization: Dynamic 4-bit quantization with Kleidiai kernels for CPU inference (#17112), TensorRT-LLM FP4 quantization optimized for MoE low-latency inference (#21331).
Expanded model quantization support: BitsAndBytes quantization for InternS1 (#21953) and additional MoE models (#21370, #21548), Gemma3n quantization compatibility (#21974), calibration-free RTN quantization for MoE models (#20766), ModelOpt Qwen3 NVFP4 support (#20101).
Performance and compatibility improvements: CUDA kernel optimization for Int8 per-token group quantization (#21476), non-contiguous tensor support in FP8 quantization (#21961), automatic detection of ModelOpt quantization formats (#22073).
Breaking change: Removed AQLM quantization support (#22943) - users should migrate to alternative quantization methods.

API & Frontend

OpenAI API compatibility: Unix domain socket support for local communication (#18097), improved error response format matching upstream specification (#22099), aligned tool_choice="required" behavior with OpenAI when tools list is empty (#21052).
New API capabilities: Dedicated LLM.reward interface for reward models (#21720), chunked processing for long inputs in embedding models (#22280), AsyncLLM proper response handling for aborted requests (#22283).
Configuration and environment: Multiple API keys support for enhanced authentication (#18548), custom vLLM tuned configuration paths (#22791), environment variable control for logging statistics (#22905), multimodal cache size (#22441), and DeepGEMM E8M0 scaling behavior (#21968).
CLI and tooling improvements: V1 API support for run-batch command (#21541), custom process naming for better monitoring (#21445), improved help display showing available choices (#21760), optional memory profiling skip for multimodal models (#22950), enhanced logging of non-default arguments (#21680).
Tool and parser support: HermesToolParser for models without special tokens (#16890), multi-turn conversation benchmarking tool (#20267).
Distributed serving enhancements: Enhanced hybrid distributed serving with multiple API servers in load balancing mode (#21510), request_id support for external load balancers (#21009).
User experience enhancements: Improved error messaging for multimodal items (#22114), per-request pooling control via PoolingParams (#20538).

Dependencies

FlashInfer updates: Updated to v0.2.8 for improved performance (#21385), moved to optional dependency install with pip install vllm[flashinfer] for flexible installation (#21959).
Mamba SSM restructuring: Updated to version 2.2.5 (#21421), removed from core requirements to reduce installation complexity (#22541).
Docker and deployment: Docker-aware precompiled wheel support for easier containerized deployment (#21127, #22106).
Python package updates: OpenAI Python dependency updated to latest version for API compatibility (#22316).
Dependency optimizations: Removed xformers requirement for Mistral-format Pixtral and Mistral3 models (#21154), deprecation warnings added for old DeepGEMM version (#22194).

V0 Deprecation

Important: As part of the ongoing V0 engine cleanup, several breaking changes have been introduced:

CLI flag updates: Replaced --task with --runner and --convert options (#21470), deprecated --disable-log-requests in favor of --enable-log-requests for clearer semantics (#21739), renamed --expand-tools-even-if-tool-choice-none to --exclude-tools-when-tool-choice-none for consistency (#20544).
API cleanup: Removed previously deprecated arguments and methods as part of ongoing V0 engine codebase cleanup (#21907).

What's Changed

Deduplicate Transformers backend code using inheritance by @hmellor in #21461
[Bugfix][ROCm] Fix for warp_size uses on host by @gshtras in #21205
[TPU][Bugfix] fix moe layer by @yaochengji in #21340
[v1][Core] Clean up usages of SpecializedManager by @zhouwfang in #21407
[Misc] Fix duplicate FusedMoEConfig debug messages by @njhill in #21455
[Core] Support model loader plugins by @22quinn in #21067
remove GLM-4 quantization wrong Code by @zRzRzRzRzRzRzR in #21435
Replace --expand-tools-even-if-tool-choice-none with --exclude-tools-when-tool-choice-none ...

Contributors

rasmith, russellb, and 198 other contributors

Assets 6

17 Aug 22:57

github-actions

v0.10.1rc1

0fc8fa7

v0.10.1rc1 Pre-release

Pre-release

What's Changed

Deduplicate Transformers backend code using inheritance by @hmellor in #21461
[Bugfix][ROCm] Fix for warp_size uses on host by @gshtras in #21205
[TPU][Bugfix] fix moe layer by @yaochengji in #21340
[v1][Core] Clean up usages of SpecializedManager by @zhouwfang in #21407
[Misc] Fix duplicate FusedMoEConfig debug messages by @njhill in #21455
[Core] Support model loader plugins by @22quinn in #21067
remove GLM-4 quantization wrong Code by @zRzRzRzRzRzRzR in #21435
Replace --expand-tools-even-if-tool-choice-none with --exclude-tools-when-tool-choice-none for v0.10.0 by @okdshin in #20544
[Misc] Improve comment for DPEngineCoreActor._set_cuda_visible_devices() by @ruisearch42 in #21501
[Feat] Allow custom naming of vLLM processes by @chaunceyjiang in #21445
bump flashinfer to v0.2.8 by @cjackal in #21385
[Attention] Optimize FlashInfer MetadataBuilder Build call by @LucasWilkinson in #21137
[Model] Officially support Emu3 with Transformers backend by @hmellor in #21319
[Bugfix] Fix CUDA arch flags for MoE permute by @minosfuture in #21426
[Fix] Update mamba_ssm to 2.2.5 by @elvischenv in #21421
[Docs] Update Tensorizer usage documentation by @sangstar in #21190
[Docs] Rewrite Distributed Inference and Serving guide by @crypdick in #20593
[Bug] Fix Compressed Tensor NVFP4 cutlass_fp4_group_mm illegal memory access by @yewentao256 in #21465
Update flashinfer CUTLASS MoE Kernel by @wenscarl in #21408
[XPU] Conditionally import CUDA-specific passes to avoid import errors on xpu platform by @chaojun-zhang in #21036
[P/D] Move FakeNixlWrapper to test dir by @ruisearch42 in #21328
[P/D] Support CPU Transfer in NixlConnector by @juncgu in #18293
[Docs][minor] Fix broken gh-file link in distributed serving docs by @crypdick in #21543
[Docs] Add Expert Parallelism Initial Documentation by @simon-mo in #21373
update flashinfer to v0.2.9rc1 by @weireweire in #21485
[TPU][TEST] HF_HUB_DISABLE_XET=1 the test 3. by @QiliangCui in #21539
[MoE] More balanced expert sharding by @WoosukKwon in #21497
[Frontend] run-batch supports V1 by @DarkLight1337 in #21541
[Docs] Fix site_url for RunLLM by @hmellor in #21564
[Bug] Fix DeepGemm Init Error by @yewentao256 in #21554
Fix GLM-4 PP Missing Layer When using with PP. by @zRzRzRzRzRzRzR in #21531
[Kernel] adding fused_moe configs for upcoming granite4 by @bringlein in #21332
[Bugfix] DeepGemm utils : Fix hardcoded type-cast by @varun-sundar-rabindranath in #21517
[DP] Support api-server-count > 0 in hybrid DP LB mode by @njhill in #21510
[TPU][Test] Temporarily suspend this MoE model in test_basic.py. by @QiliangCui in #21560
[Docs] Add requirements/common.txt to run unit tests by @zhouwfang in #21572
Integrate TensorSchema with shape validation for Phi3VImagePixelInputs by @bbeckca in #21232
[CI] Update CODEOWNERS for CPU and Intel GPU by @bigPYJ1151 in #21582
[Bugfix] fix modelscope snapshot_download serialization by @andyxning in #21536
[Model] Support tensor parallel for timm ViT in Deepseek_vl2 by @wzqd in #21494
[Model] Fix a check for None but the return value was empty list in Gemma3 MM vision_embeddings by @hfan in #21479
[Misc][Tools] make max-model-len a parameter in auto_tune script by @yaochengji in #21321
[CI/Build] fix cpu_extension for apple silicon by @ignaciosica in #21195
[Misc] Removed undefined cmake variables MOE_PERMUTE_ARCHS by @chenyang78 in #21262
[TPU][Bugfix] fix OOM issue in CI test by @yaochengji in #21550
[Tests] Harden DP tests by @njhill in #21508
Add H20-3e fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-Instruct by @Xu-Wenqing in #21598
[Bugfix] GGUF: fix AttributeError: 'PosixPath' object has no attribute 'startswith' by @kebe7jun in #21579
[Quantization] Enable BNB support for more MoE models by @jeejeelee in #21370
[V1] Get supported tasks from model runner instead of model config by @DarkLight1337 in #21585
[Bugfix][Logprobs] Fix logprobs op to support more backend by @MengqingCao in #21591
[Model] Fix Ernie4.5MoE e_score_correction_bias parameter by @xyxinyang in #21586
[MODEL] New model support for naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B by @bigshanedogg in #20931
[Frontend] Add request_id to the Request object so they can be controlled better via external load balancers by @kouroshHakha in #21009
[Model] Replace Mamba2 RMSNorm Gated with Fused Triton Kernel by @cyang49 in #20839
[ROCm][AITER] Enable fp8 kv cache on rocm aiter backend. by @fsx950223 in #20295
[Kernel] Improve machete memory bound perf by @czhu-cohere in #21556
Add support for Prithvi in Online serving mode by @mgazz in #21518
[CI] Unifying Dockerfiles for ARM and X86 Builds by @kebe7jun in #21343
[Docs] add auto-round quantization readme by @wenhuach21 in #21600
[TPU][Test] Rollback PR-21550. by @QiliangCui in #21619
Add Unsloth to RLHF.md by @danielhanchen in #21636
[Perf] Cuda Kernel for Int8 Per Token Group Quant by @yewentao256 in #21476
Add interleaved RoPE test for Llama4 (Maverick) by @sarckk in #21478
[Bugfix] Fix sync_and_slice_intermediate_tensors by @ruisearch42 in #21537
[Bugfix] Always set RAY_ADDRESS for Ray actor before spawn by @ruisearch42 in #21540
[TPU] Update ptxla nightly version to 20250724 by @yaochengji in #21555
[Feature] Add support for MoE models in the calibration-free RTN-based quantization by @sakogan in #20766
[Model] Ultravox: Support Llama 4 and Gemma 3 backends by @farzadab in #17818
[Docs] add offline serving multi-modal video input expamle Qwen2.5-VL by @david6666666 in #21530
Correctly kill vLLM processes after finishing serving benchmarks by @huydhn in #21641
[Bugfix] Fix isinstance check for tensor types in _load_prompt_embeds to use dtype comparison by @Mitix-EPI in #21612
[TPU][Test] Divide TPU v1 Test into 2 parts. by @QiliangCui in #21431
Support Intern-S1 by @lvhan028 in #21628
[Misc] remove unused try-except in pooling config check by @reidliu41 in #21618
[Take 2] Correctly kill vLLM processes after benchmarks by @huydhn in #21646
Migrate AriaImagePixelInputs to TensorSchema for shape validation by @bbeckca in #21620
Migrate AyaVisionImagePixelInputs to TensorSchema for shape validation by @bbeckca in #21622
[Bugfix] Investigate Qwen2-VL failing test by @Isotr0py in #21527
Support encoder-only models without KV-Cache by @maxdebayser in #21270
[Bug] Fix has_flashinfer_moe Import Error when it is not installed by @yewentao256 in #21634
[Misc] Improve memory profiling debug message by @yeqcharlotte in #21429
[BugF...

Contributors

rasmith, russellb, and 198 other contributors

Assets 2

24 Jul 22:43

github-actions

v0.10.0

6d8d0a2

v0.10.0

Highlights

v0.10.0 release includes 308 commits, 168 contributors (62 new!).

NOTE: This release begins the cleanup of V0 engine codebase. We have removed V0 CPU/XPU/TPU/HPU backends (#20412), long context LoRA (#21169), Prompt Adapters (#20588), Phi3-Small & BlockSparse Attention (#21217), and Spec Decode workers (#21152) so far and plan to continued to delete code that is no longer used.

Model Support

New families: Llama 4 with EAGLE support (#20591), EXAONE 4.0 (#21060), Microsoft Phi-4-mini-flash-reasoning (#20702), Hunyuan V1 Dense + A13B with reasoning/tool parsing (#21368, #20625, #20820), Ling MoE models (#20680), JinaVL Reranker (#20260), Nemotron-Nano-VL-8B-V1 (#20349), Arcee (#21296), Voxtral (#20970).
Enhanced compatibility: BERT/RoBERTa with AutoWeightsLoader (#20534), HF format support for MiniMax (#20211), Gemini configuration (#20971), GLM-4 updates (#20736).
Architecture expansions: Attention-free model support (#20811), Hybrid SSM/Attention models on V1 (#20016), LlamaForSequenceClassification (#20807), expanded Mamba2 layer support (#20660).
VLM improvements: VLM support with transformers backend (#20543), PrithviMAE on V1 engine (#20577).

Engine Core

Experimental async scheduling --async-scheduling flag to overlap engine core scheduling with GPU runner (#19970).
V1 engine improvements: backend-agnostic local attention (#21093), MLA FlashInfer ragged prefill (#20034), hybrid KV cache with local chunked attention (#19351).
Multi-task support: models can now support multiple tasks (#20771), multiple poolers (#21227), and dynamic pooling parameter configuration (#21128).
RLHF Support: new RPC methods for runtime weight reloading (#20096) and config updates (#20095), logprobs mode for selecting which stage of logprobs to return (#21398).
Enhanced caching: multi-modal caching for transformers backend (#21358), reproducible prefix cache hashing using SHA-256 + CBOR (#20511).
Startup time reduction via CUDA graph capture speedup via frozen GC (#21146).
Elastic expert parallel for dynamic GPU scaling while preserving state (#20775).

Hardwares & Performance

NVIDIA Blackwell/SM100 optimizations: CUTLASS block scaled group GEMM for smaller batches (#20640), FP8 groupGEMM support (#20447), DeepGEMM integration (#20087), FlashInfer MoE blockscale FP8 backend (#20645), CUDNN prefill API for MLA (#20411), Triton Fused MoE kernel config for FP8 E=16 on B200 (#20516).
Performance improvements: 48% request duration reduction via microbatch tokenization for concurrent requests (#19334), fused MLA QKV + strided layernorm (#21116), Triton causal-conv1d for Mamba models (#18218).
Hardware expansion: ARM CPU int8 quantization (#14129), PPC64LE/ARM V1 support (#20554), Intel XPU ray distributed execution (#20659), shared-memory pipeline parallel for CPU (#21289), FlashInfer ARM CUDA support (#21013).

Quantization

New quantization support: MXFP4 for MoE models (#17888), BNB support for Mixtral and additional MoE models (#20893, #21100), in-flight quantization for MoE (#20061).
Hardware-specific: FP8 KV cache quantization on TPU (#19292), FP8 support for BatchedTritonExperts (#18864), optimized INT8 vectorization kernels (#20331).
Performance optimizations: Triton backend for DeepGEMM per-token group quantization (#20841), CUDA kernel for per-token group quantization (#21083), CustomOp abstraction for FP8 (#19830).

API & Frontend

OpenAI compatibility: Responses API implementation (#20504, #20975), image object support in llm.chat (#19635), tool calling with required choice and $defs (#20629).
New endpoints: get_tokenizer_info for tokenizer/chat-template information (#20575), cache_salt support for completions/responses (#20981).
Model loading: Tensorizer S3 integration with arbitrary arguments (#19619), HF repo paths & URLs for GGUF models (#20793), tokenization_kwargs for embedding truncation (#21033).
CLI improvements: --help=page option for enhanced help documentation (#20961), default model changed to Qwen3-0.6B (#20335).

Dependencies

Updated PyTorch to 2.7.1 for CUDA (#21011)
FlashInfer updated to v0.2.8rc1 (#20718)

What's Changed

[Docs] Note that alternative structured output backends are supported by @russellb in #19426
[ROCm][V1] Adding ROCm to the list of plaforms using V1 by default by @gshtras in #19440
[Model] use AutoWeightsLoader for commandr by @py-andy-c in #19399
Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 by @Xu-Wenqing in #19401
[BugFix] Allow use_cudagraph to work with dynamic VLLM_USE_V1 by @zou3519 in #19390
[New Model]: Support Qwen3 Embedding & Reranker by @noooop in #19260
[BugFix] Fix docker build cpu-dev image error by @2niuhe in #19394
Fix test_max_model_len in tests/entrypoints/llm/test_generate.py by @houseroad in #19451
[CI] Disable failing GGUF model test by @mgoin in #19454
[Misc] Remove unused MultiModalHasher.hash_prompt_mm_data by @lgeiger in #19422
Add fused MOE config for Qwen3 30B A3B on B200 by @0xjunhao in #19455
Fix Typo in Documentation and Function Name by @leopardracer in #19442
[ROCm] Add rules to automatically label ROCm related PRs by @houseroad in #19405
[Kernel] Support deep_gemm for linear methods by @artetaout in #19085
[Doc] Update V1 User Guide for Hardware and Models by @DarkLight1337 in #19474
[Doc] Fix quantization link titles by @DarkLight1337 in #19478
[Doc] Support "important" and "announcement" admonitions by @DarkLight1337 in #19479
[Misc] Reduce warning message introduced in env_override by @houseroad in #19476
Support non-string values in JSON keys from CLI by @DarkLight1337 in #19471
Add cache to cuda get_device_capability by @mgoin in #19436
Fix some typo by @Ximingwang-09 in #19475
Support no privileged mode on CPU for docker and kubernetes deployments by @louie-tsai in #19241
[Bugfix] Update the example code, make it work with the latest lmcache by @runzhen in #19453
[CI] Update FlashInfer to 0.2.6.post1 by @mgoin in #19297
[doc] fix "Other AI accelerators" getting started page by @davidxia in #19457
[Misc] Fix misleading ROCm warning by @jeejeelee in #19486
[Docs] Remove WIP features in V1 guide by @WoosukKwon in #19498
[Kernels] Add activation chunking logic to FusedMoEModularKernel by @bnellnm in #19168
[AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger by @rasmith in #17331
[UX] Add Feedback During CUDAGraph Capture by @robertgshaw2-redhat in #19501
[CI/Build] Fix torch nightly CI dependencies by @zou3519 in #19505
[CI] change spell checker from codespell to typos by @andyxning in #18711
[BugFix] Force registration of w8a8_block_fp8_matmul_deepgemm via lazy import by @varun-sundar-rabindranath in #19514
Add Triton Fused MoE kernel config for E=16 on B200 by @b8zhong in #19518
[Frontend] Improve error message in tool_choice validation by @22quinn in #19239
[BugFix] Work-around incremental detokenization edge case error by @njhill in #19449
[BugFix] Handle missing sep_token for Qwen3-Reranker in Score API by @strutive07 in #19522
[AMD][Kernel][BugFix] fix test_rocm_compressed_tensors_w8a8 for rocm by @rasmith in #19509
Fix typo by @2niuhe in #19525
[Security] Prevent new imports of (cloud)pickle by @russellb in #18018
[Bugfix][V1] Allow manual FlashAttention for Blackwell by @mgoin in #19492
[Bugfix] Respect num-gpu-blocks-override in v1 by @jmswen in #19503
[Quantization] Improve AWQ logic by @jeejeelee in #19431
[Doc] Add V1 column to supported models list by @DarkLight1337 in #19523
[NixlConnector] Drop num_blocks check by @NickLucche in #19532
[Perf] Vectorize static / dynamic INT8 quant kernels by @yewentao256 in #19233
Fix TorchAOConfig skip layers by @mobicham in #19265
[torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass by @ProExpertProg in https://github.com/vllm-proj...

Contributors

rasmith, kzjeef, and 198 other contributors

Assets 6

24 Jul 05:04

github-actions

v0.10.0rc2

6d8d0a2

v0.10.0rc2 Pre-release

Pre-release

What's Changed

[Model] use AutoWeightsLoader for bart by @calvin0327 in #18299
[Model] Support VLMs with transformers backend by @zucchini-nlp in #20543
[bugfix] fix syntax warning caused by backslash by @1195343015 in #21251
[CI] Cleanup modelscope version constraint in Dockerfile by @yankay in #21243
[Docs] Add RFC Meeting to Issue Template by @simon-mo in #21279
Add the instruction to run e2e validation manually before release by @huydhn in #21023
[Bugfix] Fix missing placeholder in logger debug by @DarkLight1337 in #21280
[Model][1/N] Support multiple poolers at model level by @DarkLight1337 in #21227
[Docs] Fix hardcoded links in docs by @hmellor in #21287
[Docs] Make tables more space efficient in supported_models.md by @hmellor in #21291
[Misc] unify variable for LLM instance by @andyxning in #20996
Add Nvidia ModelOpt config adaptation by @Edwardf0t1 in #19815
[Misc] Add sliding window to flashinfer test by @WoosukKwon in #21282
[CPU] Enable shared-memory based pipeline parallel for CPU backend by @bigPYJ1151 in #21289
[BugFix] make utils.current_stream thread-safety (#21252) by @simpx in #21253
[Misc] Add dummy maverick test by @minosfuture in #21199
[Attention] Clean up iRoPE in V1 by @LucasWilkinson in #21188
[DP] Fix Prometheus Logging by @robertgshaw2-redhat in #21257
Fix bad lm-eval fork by @mgoin in #21318
[perf] Speed up align sum kernels by @hj-mistral in #21079
[v1][sampler] Inplace logprobs comparison to get the token rank by @houseroad in #21283
[XPU] Enable external_launcher to serve as an executor via torchrun by @chaojun-zhang in #21021
[Doc] Fix CPU doc format by @bigPYJ1151 in #21316
[Intel GPU] Ray Compiled Graph avoid NCCL for Intel GPU by @ratnampa in #21338
Revert "[Performance] Performance improvements in non-blockwise fp8 CUTLASS MoE (#20762) by @minosfuture in #21334
[Core] Minimize number of dict lookup in _maybe_evict_cached_block by @Jialin in #21281
[V1] [Hybrid] Add new test to verify that hybrid views into KVCacheTensor are compatible by @tdoublep in #21300
[Refactor] Fix Compile Warning #1444-D by @yewentao256 in #21208
Fix kv_cache_dtype handling for out-of-tree HPU plugin by @kzawora-intel in #21302
[Misc] DeepEPHighThroughtput - Enable Inductor pass by @varun-sundar-rabindranath in #21311
[Bug] DeepGemm: Fix Cuda Init Error by @yewentao256 in #21312
Update fp4 quantize API by @wenscarl in #21327
[Feature][eplb] add verify ep or tp or dp by @lengrongfu in #21102
Add arcee model by @alyosha-swamy in #21296
[Bugfix] Fix eviction cached blocked logic by @simon-mo in #21357
[Misc] Remove deprecated args in v0.10 by @kebe7jun in #21349
[Core] Optimize update checks in LogitsProcessor by @Jialin in #21245
[benchmark] Port benchmark request sent optimization to benchmark_serving by @Jialin in #21209
[Core] Introduce popleft_n and append_n in FreeKVCacheBlockQueue to further optimize block_pool by @Jialin in #21222
[Misc] unify variable for LLM instance v2 by @andyxning in #21356
[perf] Add fused MLA QKV + strided layernorm by @mickaelseznec in #21116
[feat]: add SM100 support for cutlass FP8 groupGEMM by @djmmoss in #20447
[Perf] Cuda Kernel for Per Token Group Quant by @yewentao256 in #21083
Adds parallel model weight loading for runai_streamer by @bbartels in #21330
[feat] Enable mm caching for transformers backend by @zucchini-nlp in #21358
Revert "[Refactor] Fix Compile Warning #1444-D (#21208)" by @yewentao256 in #21384
Add tokenization_kwargs to encode for embedding model truncation by @Receiling in #21033
[Bugfix] Decode Tokenized IDs to Strings for hf_processor in llm.chat() with model_impl=transformers by @ariG23498 in #21353
[CI/Build] Fix test failure due to updated model repo by @DarkLight1337 in #21375
Fix Flashinfer Allreduce+Norm enable disable calculation based on fi_allreduce_fusion_max_token_num by @xinli-git in #21325
[Model] Add Qwen3CoderToolParser by @ranpox in #21396
[Misc] Copy HF_TOKEN env var to Ray workers by @ruisearch42 in #21406
[BugFix] Fix ray import error mem cleanup bug by @joerunde in #21381
[CI/Build] Fix model executor tests by @DarkLight1337 in #21387
[Bugfix][ROCm][Build] Fix build regression on ROCm by @gshtras in #21393
Simplify weight loading in Transformers backend by @hmellor in #21382
[BugFix] Update python to python3 calls for image; fix prefix & input calculations. by @ericehanley in #21391
[BUGFIX] deepseek-v2-lite failed due to fused_qkv_a_proj name update by @xuechendi in #21414
[Bugfix][CUDA] fixes CUDA FP8 kv cache dtype supported by @elvischenv in #21420
Changing "amdproduction" allocation. by @Alexei-V-Ivanov-AMD in #21409
[Bugfix] Fix nightly transformers CI failure by @Isotr0py in #21427
[Core] Add basic unit test for maybe_evict_cached_block by @Jialin in #21400
[Cleanup] Only log MoE DP setup warning if DP is enabled by @mgoin in #21315
add clear messages for deprecated models by @youkaichao in #21424
[Bugfix] ensure tool_choice is popped when tool_choice:null is passed in json payload by @gcalmettes in #19679
Fixed typo in profiling logs by @sergiopaniego in #21441
[Docs] Fix bullets and grammars in tool_calling.md by @windsonsea in #21440
[Sampler] Introduce logprobs mode for logging by @houseroad in #21398
Mamba V2 Test not Asserting Failures. by @fabianlim in #21379
[Misc] fixed nvfp4_moe test failures due to invalid kwargs by @chenyang78 in #21246
[Docs] Clean up v1/metrics.md by @windsonsea in #21449
[Model] add Hunyuan V1 Dense Model support. by @kzjeef in #21368
[V1] Check all pooling tasks during profiling by @DarkLight1337 in #21299
[Bugfix][Qwen][DCA] fixes bug in dual-chunk-flash-attn backend for qwen 1m models. by @sighingnow in #21364
[Tests] Add tests for headless internal DP LB by @njhill in #21450
[Core][Model] PrithviMAE Enablement on vLLM v1 engine by @christian-pinto in #20577
Add test case for compiling multiple graphs by @sarckk in #21044
[TPU][TEST] Fix the downloading issue in TPU v1 test 11. by @QiliangCui in #21418
[Core] Add reload_weights RPC method by @22quinn in #20096
[V1] Fix local chunked attention always disabled by @sarckk in #21419
[V0 Deprecation] Remove Prompt Adapters by @mgoin in #20588
[Core] Freeze gc during cuda graph capture to speed up init by @mgoin in #21146
feat(gguf_loader): accept HF repo paths & URLs for GGUF by @hardikkgupta in #20793
[Frontend] Set MAX_AUDIO_CLI...

Contributors

kzjeef, simpx, and 64 other contributors

Assets 2

Uh oh!

Releases: vllm-project/vllm

v0.12.0

vLLM v0.12.0 Release Notes Highlights

Highlights

Model Support

Engine Core

Hardware & Performance

Quantization

API & Frontend

Dependencies

V0 Deprecation & Breaking Changes

New Contributors

Contributors

Uh oh!

v0.11.2

Uh oh!

v0.11.1

Highlights

What's Changed

Contributors

Uh oh!

v0.11.0

Highlights

Model Support

Engine Core

Hardware & Performance

Large Scale Serving & Performance

Quantization

API & Frontend

Security

Dependencies

V0 Deprecation

What's Changed

Contributors

Uh oh!

v0.10.2

Highlights

Model Support

Engine Core

Hardware & Performance

Quantization

API & Frontend

Dependencies

V0 Deprecation

Breaking Changes

What's Changed

Contributors

Uh oh!

v0.10.1.1

Uh oh!

v0.10.1

Highlights

Model Support

Engine Core

Hardware & Performance

Quantization

API & Frontend

Dependencies

V0 Deprecation

What's Changed

Contributors

Uh oh!

v0.10.1rc1

What's Changed

Contributors

Uh oh!

v0.10.0

Highlights

Model Support

Engine Core

Hardwares & Performance

Quantization

API & Frontend

Dependencies

What's Changed

Contributors

Uh oh!

v0.10.0rc2

What's Changed