Skip to content

Releases: vllm-project/vllm

v0.12.0

03 Dec 09:36

Choose a tag to compare

vLLM v0.12.0 Release Notes Highlights

Highlights

This release features 474 commits from 213 contributors (57 new)!

Breaking Changes: This release includes PyTorch 2.9.0 upgrade (CUDA 12.9), V0 deprecations including xformers backend, and scheduled removals - please review the changelog carefully.

Major Features:

  • EAGLE Speculative Decoding Improvements: Multi-step CUDA graph support (#29559), DP>1 support (#26086), and multimodal support with Qwen3VL (#29594).
  • Significant Performance Optimizations: 18.1% throughput improvement from batch invariant BMM (#29345), 2.2% throughput improvement from shared experts overlap (#28879).
  • AMD ROCm Expansion: DeepSeek v3.2 + SparseMLA support (#26670), FP8 MLA decode (#28032), AITER attention backend (#28701).

Model Support

  • New model families: PLaMo-3 (#28834), OpenCUA-7B (#29068), HunyuanOCR (#29327), Mistral Large 3 and Ministral 3 (#29757).
  • Format support: Gemma3 GGUF multimodal support (#27772).
  • Multimodal enhancements: Qwen3 Omni audio-in-video support (#27721), Eagle3 multimodal support for Qwen3VL (#29594).
  • Performance: QwenVL cos/sin cache optimization (#28798).

Engine Core

  • GPU Model Runner V2 (Experimental) (#25266): Complete refactoring of model execution pipeline:

    • No "reordering" or complex bookkeeping with persistent batch removal
    • GPU-persistent block tables for better scalability with max_model_len and num_kv_groups
    • Triton-native sampler: no -1 temperature hack, efficient per-request seeds, memory-efficient prompt logprobs
    • Simplified DP and CUDA graph implementations
    • Efficient structured outputs support
  • Prefill Context Parallel (PCP) (Preparatory) (#28718): Partitions the sequence dimension during prefill for improved long-sequence inference. Complements existing Decode Context Parallel (DCP). See RFC #25749 for details.

  • RLHF Support: Pause and Resume Generation for Asynchronous RL Training (#28037).

  • KV Cache Enhancements: Cross-layer KV blocks support (#27743), KV cache residency metrics (#27793).

  • Audio support: Audio embeddings support in chat completions (#29059).

  • Speculative Decoding:

    • Multi-step Eagle with CUDA graph (#29559)
    • EAGLE DP>1 support (#26086)
    • EAGLE3 heads without use_aux_hidden_states (#27688)
    • Eagle multimodal CUDA graphs with MRoPE (#28896)
    • Logprobs support with spec decode + async scheduling (#29223)
  • Configuration: Flexible inputs_embeds_size separate from hidden_size (#29741), --fully-sharded-loras for fused_moe (#28761).

Hardware & Performance

  • NVIDIA Performance:

    • Batch invariant BMM optimization: 18.1% throughput improvement, 10.7% TTFT improvement on DeepSeek-V3.1 (#29345)
    • Shared Experts Overlap with FlashInfer DeepGEMM: 2.2% throughput improvement, 3.6% TTFT improvement at batch size 32 (#28879)
    • DeepGEMM N dim restriction reduced from 128 to 64 multiplier (#28687)
    • DeepEP low-latency with round-robin expert placement (#28449)
    • NVFP4 MoE CUTLASS support for SM120 (#29242)
    • H200 Fused MoE Config improvements (#28992)
  • AMD ROCm:

    • DeepSeek v3.2 and SparseMLA support (#26670)
    • FP8 MLA decode support (#28032)
    • AITER sampling ops integration (#26084)
    • AITER triton attention backend (#28701)
    • Bitsandbytes quantization on AMD GPUs with warp size 32 (#27307)
    • Fastsafetensors support (#28225)
    • Sliding window support for AiterFlashAttentionBackend (#29234)
    • Whisper v1 with Aiter Unified/Flash Attention (#28376)
  • CPU:

    • Paged attention GEMM acceleration on ARM CPUs with NEON (#29193)
    • Parallelize over tokens in int4 MoE (#29600)
    • CPU all reduce optimization for async_scheduling + DP>1 (#29311)
  • Attention: FlashAttention ViT support, now default backend (#28763).

  • Long Context: Optimized gather_and_maybe_dequant_cache kernel for extremely long sequences (#28029).

  • Multi-NUMA: Enhanced NUMA functionality for systems with multiple NUMA nodes per socket (#25559).

  • Docker: Image size reduced by ~200MB (#29060).

Quantization

  • W4A8: Marlin kernel support (#24722).
  • NVFP4:
    • MoE CUTLASS support for SM120 (#29242)
    • TRTLLM MoE NVFP4 kernel (#28892)
    • CuteDSL MoE with NVFP4 DeepEP dispatch (#27141)
    • Non-gated activations support in modelopt path (#29004)
  • AWQ: Compressed-tensors AWQ support for Turing GPUs (#29732).
  • LoRA: FusedMoE LoRA Triton kernel for MXFP4 (#29708).
  • Online quantization: Moved to model.load_weights (#26327).

API & Frontend

  • Responses API:
    • Multi-turn support for non-harmony requests (#29175)
    • Reasoning item input parsing (#28248)
  • Tool Calling:
    • Parsed tool arguments support (#28820)
    • parallel_tool_calls param compliance (#26233)
    • Tool filtering support in ToolServer (#29224)
  • Whisper: verbose_json and timestamp features for transcription/translation (#24209).
  • Sampling: Flat logprob control moved from env var to SamplingParams (#28914).
  • GGUF: Improved HuggingFace loading UX with repo_id:quant_type syntax (#29137).
  • Profiling: Iteration-level profiling for Torch and CUDA profiler (#28987).
  • Logs: Colorized log output (#29017).
  • Optimization Levels: -O0, -O1, -O2, -O3 allow trading startup time for performance, more compilation flags will be added in future releases (#26847)

Dependencies

  • PyTorch 2.9.0 with CUDA 12.9 (#24994) - Breaking change requiring environment updates.
  • xgrammar: Updated to 0.1.27 (#28221).
  • Transformers: Updated to 4.57.3 (#29418), preparation for v5 with rope_parameters (#28542).
  • XPU: torch & IPEX 2.9 upgrade (#29307).

V0 Deprecation & Breaking Changes

Removed Parameters:

Deprecated:

Scheduled Removals (will be removed in future release):

  • ParallelConfig's direct child EPLB fields (#29324)
  • guided_* config fields (#29326)
  • override_pooler_config and disable_log_requests (#29402)
  • CompilationConfig.use_inductor (#29323)
  • Deprecated metrics (#29330)

Other Breaking Changes:

  • PyTorch 2.9.0 upgrade requires CUDA 12.9 environment
  • Mistral format auto-detection for model loading (#28659)

New Contributors

Read more

v0.11.2

20 Nov 07:29

Choose a tag to compare

This release includes 4 bug fixes on top of v0.11.1:

  • [BugFix] Ray with multiple nodes (#28873)
  • [BugFix] Fix false assertion with spec-decode=[2,4,..] and TP>2 (#29036)
  • [BugFix] Fix async-scheduling + FlashAttn MLA (#28990)
  • [NVIDIA] Guard SM100 CUTLASS MoE macro to SM100 builds v2 (#28938)

v0.11.1

18 Nov 23:03
4393684

Choose a tag to compare

Highlights

This release includes 1456 commits from 449 contributors (184 new contributors)!

Key changes include:

  • PyTorch 2.9.0 + CUDA 12.9.1: Updated the default CUDA build to torch==2.9.0+cu129, enabling Inductor partitioning and landing multiple fixes in graph-partition rules and compile-cache integration.
  • Batch-invariant torch.compile: Generalized batch-invariant support across attention and MoE backends, with explicit support for DeepGEMM and FlashInfer on Hopper and Blackwell GPUs.
  • Robust async scheduling: Fixed several correctness and stability issues in async scheduling, especially when combined with chunked prefill, structured outputs, priority scheduling, MTP, and DeepEP / DCP. We expect --async-scheduling to be enabled by default in the next release.
  • Stronger scheduler + KV ecosystem: Improved test coverage in CI and made scheduler behavior more robust with KV connectors, prefix caching, and multi-node deployments.
  • Anthropic API Support: Added support for the /v1/messages endpoint, allowing users to interact with vllm serve using Anthropic-compatible clients.

Detailed release notes will be updated in the next few days.

What's Changed

Read more

v0.11.0

02 Oct 19:17

Choose a tag to compare

Highlights

This release features 538 commits, 207 contributors (65 new contributors)!

  • This release completes the removal of V0 engine. V0 engine code including AsyncLLMEngine, LLMEngine, MQLLMEngine, all attention backends, and related components have been removed. V1 is the only engine in the codebase now.
  • This releases turns on FULL_AND_PIECEWISE as the CUDA graph mode default. This should provide better out of the box performance for most models, particularly fine-grained MoEs, while preserving compatibility with existing models supporting only PIECEWISE mode.

Note: In v0.11.0 (and v0.10.2), --async-scheduling will produce gibberish output in some cases such as preemption and others. This functionality is correct in v0.10.1. We are actively fixing it for the next version.

Model Support

  • New architectures: DeepSeek-V3.2-Exp (#25896), Qwen3-VL series (#24727), Qwen3-Next (#24526), OLMo3 (#24534), LongCat-Flash (#23991), Dots OCR (#24645), Ling2.0 (#24627), CWM (#25611).
  • Encoders: RADIO encoder support (#24595), Transformers backend support for encoder-only models (#25174).
  • Task expansion: BERT token classification/NER (#24872), multimodal models for pooling tasks (#24451).
  • Data parallel for vision encoders: InternVL (#23909), Qwen2-VL (#25445), Qwen3-VL (#24955).
  • Speculative decoding: EAGLE3 for MiniCPM3 (#24243) and GPT-OSS (#25246).
  • Features: Qwen3-VL text-only mode (#26000), EVS video token pruning (#22980), Mamba2 TP+quantization (#24593), MRoPE + YaRN (#25384), Whisper on XPU (#25123), LongCat-Flash-Chat tool calling (#24083).
  • Performance: GLM-4.1V 916ms TTFT reduction via fused RMSNorm (#24733), GLM-4 MoE SharedFusedMoE optimization (#24849), Qwen2.5-VL CUDA sync removal (#24741), Qwen3-VL Triton MRoPE kernel (#25055), FP8 checkpoints for Qwen3-Next (#25079).
  • Reasoning: SeedOSS reason parser (#24263).

Engine Core

  • KV cache offloading: CPU offloading with LRU management (#19848, #20075, #21448, #22595, #24251).
  • V1 features: Prompt embeddings (#24278), sharded state loading (#25308), FlexAttention sliding window (#24089), LLM.apply_model (#18465).
  • Hybrid allocator: Pipeline parallel (#23974), varying hidden sizes (#25101).
  • Async scheduling: Uniprocessor executor support (#24219).
  • Architecture: Tokenizer group removal (#24078), shared memory multimodal caching (#20452).
  • Attention: Hybrid SSM/Attention in Triton (#21197), FlashAttention 3 for ViT (#24347).
  • Performance: FlashInfer RoPE 2x speedup (#21126), fused Q/K RoPE 11% improvement (#24511, #25005), 8x spec decode overhead reduction (#24986), FlashInfer spec decode with 1.14x speedup (#25196), model info caching (#23558), inputs_embeds copy avoidance (#25739).
  • LoRA: Optimized weight loading (#25403).
  • Defaults: CUDA graph mode FULL_AND_PIECEWISE (#25444), Inductor standalone compile disabled (#25391).
  • torch.compile: CUDA graph Inductor partition integration (#24281).

Hardware & Performance

  • NVIDIA: FP8 FlashInfer MLA decode (#24705), BF16 fused MoE for Hopper/Blackwell expert parallel (#25503).
  • DeepGEMM: Enabled by default (#24462), 5.5% throughput improvement (#24783).
  • New architectures: RISC-V 64-bit (#22112), ARM non-x86 CPU (#25166), ARM 4-bit fused MoE (#23809).
  • AMD: ROCm 7.0 (#25178), GLM-4.5 MI300X tuning (#25703).
  • Intel XPU: MoE DP accuracy fix (#25465).

Large Scale Serving & Performance

  • Dual-Batch Overlap (DBO): Overlapping computation mechanism (#23693), DeepEP high throughput + prefill (#24845).
  • Data Parallelism: torchrun launcher (#24899), Ray placement groups (#25026), Triton DP/EP kernels (#24588).
  • EPLB: Hunyuan V1 (#23078), Mixtral (#22842), static placement (#23745), reduced overhead (#24573).
  • Disaggregated serving: KV transfer metrics (#22188), NIXL MLA latent dimension (#25902).
  • MoE: Shared expert overlap optimization (#24254), SiLU kernel for DeepSeek-R1 (#24054), Enable Allgather/ReduceScatter backend for NaiveAllToAll (#23964).
  • Distributed: NCCL symmetric memory with 3-4% throughput improvement (#24532), enabled by default for TP (#25070).

Quantization

  • FP8: Per-token-group quantization (#24342), hardware-accelerated instructions (#24757), torch.compile KV cache (#22758), paged attention update (#22222).
  • FP4: NVFP4 for dense models (#25609), Gemma3 (#22771), Llama 3.1 405B (#25135).
  • W4A8: Faster preprocessing (#23972).
  • Compressed tensors: Blocked FP8 for MoE (#25219).

API & Frontend

  • OpenAI: Prompt logprobs for all tokens (#24956), logprobs=-1 for full vocab (#25031), reasoning streaming events (#24938), Responses API MCP tools (#24628, #24985), health 503 on dead engine (#24897).
  • Multimodal: Media UUID caching (#23950), image path format (#25081).
  • Tool calling: XML parser for Qwen3-Coder (#25028), Hermes-style tokens (#25281).
  • CLI: --enable-logging (#25610), improved --help (#24903).
  • Config: Speculative model engine args (#25250), env validation (#24761), NVTX profiling (#25501), guided decoding backward compatibility (#25615, #25422).
  • Metrics: V1 TPOT histogram (#24015), hidden deprecated gpu_ metrics (#24245), KV cache GiB units (#25204, #25479).
  • UX: Removed misleading quantization warning (#25012).

Security

Dependencies

  • PyTorch 2.8 for CPU (#25652), FlashInfer 0.3.1 (#24470), CUDA 13 (#24599), ROCm 7.0 (#25178).
  • Build requirements: C++17 now enforced globally (#24823).
  • TPU: Deprecated xm.mark_step in favor of torch_xla.sync (#25254).

V0 Deprecation

What's Changed

Read more

v0.10.2

13 Sep 06:37

Choose a tag to compare

Highlights

This release contains 740 commits from 266 contributors (97 new)!

Breaking Changes: This release includes PyTorch 2.8.0 upgrade, V0 deprecations, and API changes - please review the changelog carefully.

aarch64 support: This release features native support for aarch64 allowing usage of vLLM on GB200 platform. The docker image vllm/vllm-openai should already be multiplatform. To install the wheels, you can download the wheels from this release artifact or install via

uv pip install vllm==0.10.2 --extra-index-url https://wheels.vllm.ai/0.10.2/ --torch-backend=auto

Model Support

  • New model families and enhancements: Apertus (#23068), LFM2 (#22845), MiDashengLM (#23652), Motif-1-Tiny (#23414), Seed-Oss (#23241), Google EmbeddingGemma-300m (#24318), GTE sequence classification (#23524), Donut OCR model (#23229), KeyeVL-1.5-8B (#23838), R-4B vision model (#23246), Ernie4.5 VL (#22514), MiniCPM-V 4.5 (#23586), Ovis2.5 (#23084), Qwen3-Next with hybrid attention (#24526), InternVL3.5 with video support (#23658), Qwen2Audio embeddings (#23625), NemotronH Nano VLM (#23644), BLOOM V1 engine support (#23488), and Whisper encoder-decoder for V1 (#21088).
  • Pipeline parallelism expansion: Added PP support for Hunyuan (#24212), Ovis2.5 (#23405), GPT-OSS (#23680), and Kimi-VL-A3B-Thinking-2506 (#23114).
  • Data parallelism for vision models: Enabled DP for ViT across Qwen2.5VL (#22742), MiniCPM-V (#23948, #23327), Kimi-VL (#23817), and GLM-4.5V (#23168).
  • LoRA ecosystem expansion: Added LoRA support to Voxtral (#24517), Qwen-2.5-Omni (#24231), and DeepSeek models V2/V3/R1-0528 (#23971), with significantly faster LoRA startup performance (#23777).
  • Classification and pooling enhancements: Multi-label classification support (#23173), logit bias and sigmoid normalization (#24031), and FP32 precision heads for pooling models (#23810).
  • Performance optimizations: Removed unnecessary CUDA sync from GLM-4.1V (#24332) and Qwen2VL (#24334) preprocessing, eliminated redundant all-reduce in Qwen3 MoE (#23169), optimized InternVL CPU threading (#24519), and GLM4.5-V video frame decoding (#24161).

Engine Core

  • V1 engine maturation: Extended V1 support to compute capability < 8.0 (#23614, #24022), added cross-attention KV cache for encoder-decoder models (#23664), request-level logits processor integration (#23656), and KV events from connectors (#19737).
  • Backend expansion: Terratorch backend integration (#23513) enabling non-language model tasks like semantic segmentation and geospatial applications with --model-impl terratorch support.
  • Hybrid and Mamba model improvements: Enabled full CUDA graphs by default for hybrid models (#22594), disabled prefix caching for hybrid/Mamba models (#23716), added FP32 SSM kernel support (#23506), full CUDA graph support for Mamba1 (#23035), and V1 as default for Mamba models (#23650).
  • Performance core improvements: --safetensors-load-strategy for NFS based file loading acceleration (#24469), critical CUDA graph capture throughput fix (#24128), scheduler optimization for single completions (#21917), multi-threaded model weight loading (#23928), and tensor core usage enforcement for FlashInfer decode (#23214).
  • Multimodal enhancements: Multimodal cache tracking with mm_hash (#22711), UUID-based multimodal identifiers (#23394), improved V1 video embedding estimation (#24312), and simplified multimodal UUID handling (#24271).
  • Sampling and structured outputs: Support for all prompt logprobs (#23868), final logprobs (#22387), grammar bitmask optimization (#23361), and user-configurable KV cache memory size (#21489).
  • Distributed: Support Decode Context Parallel (DCP) for MLA (#23734)

Hardware & Performance

  • NVIDIA Blackwell/SM100 generation: FP8 MLA support with CUTLASS backend (#23289), DeepGEMM Linear with 1.5% E2E throughput improvement (#23351), Hopper DeepGEMM E8M0 for DeepSeekV3.1 (#23666), SM100 FlashInfer CUTLASS MoE FP8 backend (#22357), MXFP4 fused CUTLASS MoE (#23696), default MXFP4 MoE on Blackwell (#23008), and GPT-OSS DP/EP support with 52,003 tokens/s throughput (#23608).
  • Breaking change: FlashMLA disabled on Blackwell GPUs due to compatibility issues (#24521).
  • Kernel and attention optimizations: FlashAttention MLA with CUDA graph support (#14258, #23958), V1 cross-attention support (#23297), FP8 support for FlashMLA (#22668), fused grouped TopK for MoE (#23274), Flash Linear Attention kernels (#24518), and W4A8 support on Hopper (#23198).
  • Performance improvements: 13.7x speedup for token conversion (#20413), TTIT/TTFT improvements for disaggregated serving (#22760), symmetric memory all-reduce by default (#24111), FlashInfer warmup during startup (#23439), V1 model execution overlap (#23569), and various Triton configuration tuning (#23748, #23939).
  • Platform expansion: Apple Silicon bfloat16 support for M2+ (#24129), IBM Z V1 engine support (#22725), Intel XPU torch.compile (#22609), XPU MoE data parallelism (#22887), XPU Triton attention (#24149), XPU FP8 quantization (#23148), and ROCm pipeline parallelism with Ray (#24275).
  • Model-specific optimizations: Hardware-tuned MoE configurations for Qwen3-Next on B200/H200/H100 (#24698, #24688, #24699, #24695), GLM-4.5-Air-FP8 B200 configs (#23695), Kimi K2 optimization (#24597), and QWEN3 Coder/Thinking configs (#24266, #24330).

Quantization

  • New quantization capabilities: Per-layer quantization routing (#23556), GGUF quantization with layer skipping (#23188), NFP4+FP8 MoE support (#22674), W4A8 channel scales (#23570), and AMD CDNA2/CDNA3 FP4 support (#22527).
  • Advanced quantization infrastructure: Compressed tensors transforms for linear operations (#22486) enabling techniques like SpinQuantR1R2R4 and QuIP quantization methods.
  • FlashInfer quantization integration: FP8 KV cache for TRTLLM prefill attention (#24197), FP8-qkv attention kernels (#23647), and FP8 per-tensor GEMMs (#22895).
  • Platform-specific quantization: ROCm TorchAO quantization enablement (#24400) and TorchAO module swap configuration (#21982).
  • Performance optimizations: MXFP4 MoE loading cache optimization (#24154) and compressed tensors version updates (#23202).
  • Breaking change: Removed original Marlin quantization format (#23204).

API & Frontend

  • OpenAI API enhancements: Gemma3n audio transcription/translation endpoints (#23735), transcription response usage statistics (#23576), and return_token_ids parameter (#22587).
  • Response API improvements: Streaming support for non-harmony responses (#23741), non-streaming logprobs (#23319), MCP tool background mode (#23494), MCP streaming+background support (#23927), and tool output token reporting (#24285).
  • Frontend optimizations: Error stack traces with --log-error-stack (#22960), collective RPC endpoint (#23075), beam search concurrency optimization (#23599), unnecessary detokenization skipping (#24236), and custom media UUIDs (#23449).
  • Configuration enhancements: Formalized --mm-encoder-tp-mode flag (#23190), VLLM_DISABLE_PAD_FOR_CUDAGRAPH environment variable (#23595), EPLB configuration parameter (#20562), embedding endpoint chat request support (#23931), and LM Format Enforcer V1 integration (#22564).

Dependencies

  • Major updates: PyTorch 2.8.0 upgrade (#20358) - breaking change requiring environment updates, FlashInfer v0.3.0 upgrade (#24086), and FlashInfer 0.2.14.post1 maintenance update (#23537).
  • Supporting updates: XGrammar 0.1.23 (#22988), TPU core dump fix with tpu_info 0.4.0 (#23135), and compressed tensors version bump (#23202).
  • Deployment improvements: FlashInfer cubin directory environment variable (#22675) for offline environments and pre-cached CUDA binaries.

V0 Deprecation

  • Backend removals: V0 Neuron backend deprecation (#21159), V0 pooling model support removal (#23434), V0 FlashInfer attention backend removal (#22776), and V0 test cleanup (#23418, #23862).
  • API breaking changes: prompt_token_ids fallback removal from LLM.generate and LLM.embed (#18800), LoRA extra vocab size deprecation warning (#23635), LoRA bias parameter deprecation (#24339), and metrics naming change from TPOT to ITL (#24110).

Breaking Changes

  1. PyTorch 2.8.0 upgrade - Environment dependency change requiring updated CUDA versions
  2. FlashMLA Blackwell restriction - FlashMLA disabled on Blackwell GPUs due to compatibility issues
  3. V0 feature removals - Neuron backend, pooling models, FlashInfer attention backend
  4. Quantizations - Removed quantized Mixtral hack implementation, and original Marlin format.
  5. Metrics renaming - TPOT deprecated in favor of ITL

What's Changed

Read more

v0.10.1.1

20 Aug 21:20

Choose a tag to compare

This is a critical bugfix and security release:

Full Changelog: v0.10.1...v0.10.1.1

v0.10.1

18 Aug 04:39

Choose a tag to compare

Highlights

v0.10.1 release includes 727 commits, 245 committers (105 new contributors).

NOTE: This release deprecates V0 FA3 support and as a result FP8 kv-cache in V0 may have issues

Model Support

  • New model families: GPT-OSS with comprehensive tool calling and streaming support (#22327, #22330, #22332, #22335, #22339, #22340, #22342), Command-A-Vision (#22660), mBART (#22883), and SmolLM3 using Transformers backend (#22665).
  • Vision-language models: Official Eagle multimodal support with Llama4 backend (#20788), Step3 vision-language models (#21998), Gemma3n multimodal (#20495), MiniCPM-V 4.0 (#22166), HyperCLOVAX-SEED-Vision-Instruct-3B (#20931), Emu3 with Transformers backend (#21319), Intern-S1 (#21628), and Prithvi in online serving mode (#21518).
  • Enhanced existing models: NemotronH support (#22349), Ernie 4.5 Base 0.3B model name change (#21735), GLM-4.5 series improvements (#22215), Granite models with fused MoE configurations (#21332) and quantized checkpoint loading (#22925), Ultravox support for Llama 4 and Gemma 3 backends (#17818), Mamba1 and Jamba model support in V1 (without CUDA graphs) (#21249)
  • Advanced model capabilities: Qwen3 EPLB (#20815) and dual-chunk attention support (#21924), Qwen native Eagle3 target support (#22333).
  • Architecture expansions: Encoder-only models without KV-cache enabling BERT-style architectures (#21270), expanded tensor parallelism support in Transformers backend (#22651), tensor parallelism for Deepseek_vl2 vision transformer (#21494), and tensor/pipeline parallelism with Mamba2 kernel for PLaMo2 (#19674).
  • V1 engine compatibility: Extended support for additional pooling models (#21747) and Step3VisionEncoder distributed processing option (#22697).

Engine Core

  • CUDA graph performance: Full CUDA graph support with separate attention routines, adding FA2 and FlashInfer compatibility (#20059), plus 6% end-to-end throughput improvement from Cutlass MLA (#22763).
  • Attention system advances: Multiple attention metadata builders per KV cache specification (#21588), tree attention backend for v1 engine (experimental) (#20401), FlexAttention encoder-only support (#22273), upgraded FlashAttention 3 with attention sink support (#22313), and multiple attention groups for KV sharing patterns (#22672).
  • Speculative decoding optimizations: N-gram speculative decoding with single KMP token proposal algorithm (#22437), explicit EAGLE3 interface for enhanced compatibility (#22642).
  • Default behavior improvements: Pooling models now default to chunked prefill and prefix caching (#20930), disabled chunked local attention by default for Llama4 for better performance (#21761).
  • Extensibility and configuration: Model loader plugin system (#21067), custom operations support for FusedMoe (#22509), rate limiting with bucket algorithm for proxy server (#22643), torch.compile support for bailing MoE (#21664).
  • Performance optimizations: Improved startup time by disabling C++ compilation of symbolic shapes (#20836), enhanced headless models for pooling in Transformers backend (#21767).

Hardware & Performance

  • NVIDIA Blackwell (SM100) optimizations: CutlassMLA as default backend (#21626), FlashInfer MoE per-tensor scale FP8 backend (#21458), SM90 CUTLASS FP8 GEMM with kernel tuning and swap AB support (#20396).
  • NVIDIA RTX 5090/RTX PRO 6000 (SM120) support: Block FP8 quantization (#22131) and CUTLASS NVFP4 4-bit weights/activations support (#21309).
  • AMD ROCm platform enhancements: Flash Attention backend for Qwen-VL models (#22069), AITER HIP block quantization kernels (#21242), reduced device-to-host transfers (#22683), and optimized kernel performance for small batch sizes 1-4 (#21350).
  • Attention and compute optimizations: FlashAttention 3 attention sinks performance boost (#22478), Triton-based multi-dimensional RoPE replacing PyTorch implementation (#22375), async tensor parallelism for scaled matrix multiplication (#20155), optimized FlashInfer metadata building (#21137).
  • Memory and throughput improvements: Mamba2 reduced device-to-device copy overhead (#21075), fused Triton kernels for RMSNorm (#20839, #22184), improved multimodal hasher performance for repeated image prompts (#22825), multithreaded async multimodal loading (#22710).
  • Parallelization and MoE optimizations: Guided decoding throughput improvements (#21862), balanced expert sharding for MoE models (#21497), expanded fused kernel support for topk softmax (#22211), fused MoE for nomic-embed-text-v2-moe (#18321).
  • Hardware compatibility and kernels: ARM CPU build fixes for systems without BF16 support (#21848), Machete memory-bound performance improvements (#21556), FlashInfer TRT-LLM prefill attention kernel support (#22095), optimized reshape_and_cache_flash CUDA kernel (#22036), CPU transfer support in NixlConnector (#18293).
  • Specialized CUDA kernels: GPT-OSS activation functions (#22538), RLHF weight loading acceleration (#21164).

Quantization

  • Advanced quantization techniques: MXFP4 and bias support for Marlin kernel (#22428), NVFP4 GEMM FlashInfer backends (#22346), compressed-tensors mixed-precision model loading (#22468), FlashInfer MoE support for NVFP4 (#21639).
  • Hardware-optimized quantization: Dynamic 4-bit quantization with Kleidiai kernels for CPU inference (#17112), TensorRT-LLM FP4 quantization optimized for MoE low-latency inference (#21331).
  • Expanded model quantization support: BitsAndBytes quantization for InternS1 (#21953) and additional MoE models (#21370, #21548), Gemma3n quantization compatibility (#21974), calibration-free RTN quantization for MoE models (#20766), ModelOpt Qwen3 NVFP4 support (#20101).
  • Performance and compatibility improvements: CUDA kernel optimization for Int8 per-token group quantization (#21476), non-contiguous tensor support in FP8 quantization (#21961), automatic detection of ModelOpt quantization formats (#22073).
  • Breaking change: Removed AQLM quantization support (#22943) - users should migrate to alternative quantization methods.

API & Frontend

  • OpenAI API compatibility: Unix domain socket support for local communication (#18097), improved error response format matching upstream specification (#22099), aligned tool_choice="required" behavior with OpenAI when tools list is empty (#21052).
  • New API capabilities: Dedicated LLM.reward interface for reward models (#21720), chunked processing for long inputs in embedding models (#22280), AsyncLLM proper response handling for aborted requests (#22283).
  • Configuration and environment: Multiple API keys support for enhanced authentication (#18548), custom vLLM tuned configuration paths (#22791), environment variable control for logging statistics (#22905), multimodal cache size (#22441), and DeepGEMM E8M0 scaling behavior (#21968).
  • CLI and tooling improvements: V1 API support for run-batch command (#21541), custom process naming for better monitoring (#21445), improved help display showing available choices (#21760), optional memory profiling skip for multimodal models (#22950), enhanced logging of non-default arguments (#21680).
  • Tool and parser support: HermesToolParser for models without special tokens (#16890), multi-turn conversation benchmarking tool (#20267).
  • Distributed serving enhancements: Enhanced hybrid distributed serving with multiple API servers in load balancing mode (#21510), request_id support for external load balancers (#21009).
  • User experience enhancements: Improved error messaging for multimodal items (#22114), per-request pooling control via PoolingParams (#20538).

Dependencies

  • FlashInfer updates: Updated to v0.2.8 for improved performance (#21385), moved to optional dependency install with pip install vllm[flashinfer] for flexible installation (#21959).
  • Mamba SSM restructuring: Updated to version 2.2.5 (#21421), removed from core requirements to reduce installation complexity (#22541).
  • Docker and deployment: Docker-aware precompiled wheel support for easier containerized deployment (#21127, #22106).
  • Python package updates: OpenAI Python dependency updated to latest version for API compatibility (#22316).
  • Dependency optimizations: Removed xformers requirement for Mistral-format Pixtral and Mistral3 models (#21154), deprecation warnings added for old DeepGEMM version (#22194).

V0 Deprecation

Important: As part of the ongoing V0 engine cleanup, several breaking changes have been introduced:

  • CLI flag updates: Replaced --task with --runner and --convert options (#21470), deprecated --disable-log-requests in favor of --enable-log-requests for clearer semantics (#21739), renamed --expand-tools-even-if-tool-choice-none to --exclude-tools-when-tool-choice-none for consistency (#20544).
  • API cleanup: Removed previously deprecated arguments and methods as part of ongoing V0 engine codebase cleanup (#21907).

What's Changed

  • Deduplicate Transformers backend code using inheritance by @hmellor in #21461
  • [Bugfix][ROCm] Fix for warp_size uses on host by @gshtras in #21205
  • [TPU][Bugfix] fix moe layer by @yaochengji in #21340
  • [v1][Core] Clean up usages of SpecializedManager by @zhouwfang in #21407
  • [Misc] Fix duplicate FusedMoEConfig debug messages by @njhill in #21455
  • [Core] Support model loader plugins by @22quinn in #21067
  • remove GLM-4 quantization wrong Code by @zRzRzRzRzRzRzR in #21435
  • Replace --expand-tools-even-if-tool-choice-none with --exclude-tools-when-tool-choice-none ...
Read more

v0.10.1rc1

17 Aug 22:57
0fc8fa7

Choose a tag to compare

v0.10.1rc1 Pre-release
Pre-release

What's Changed

Read more

v0.10.0

24 Jul 22:43
6d8d0a2

Choose a tag to compare

Highlights

v0.10.0 release includes 308 commits, 168 contributors (62 new!).

NOTE: This release begins the cleanup of V0 engine codebase. We have removed V0 CPU/XPU/TPU/HPU backends (#20412), long context LoRA (#21169), Prompt Adapters (#20588), Phi3-Small & BlockSparse Attention (#21217), and Spec Decode workers (#21152) so far and plan to continued to delete code that is no longer used.

Model Support

  • New families: Llama 4 with EAGLE support (#20591), EXAONE 4.0 (#21060), Microsoft Phi-4-mini-flash-reasoning (#20702), Hunyuan V1 Dense + A13B with reasoning/tool parsing (#21368, #20625, #20820), Ling MoE models (#20680), JinaVL Reranker (#20260), Nemotron-Nano-VL-8B-V1 (#20349), Arcee (#21296), Voxtral (#20970).
  • Enhanced compatibility: BERT/RoBERTa with AutoWeightsLoader (#20534), HF format support for MiniMax (#20211), Gemini configuration (#20971), GLM-4 updates (#20736).
  • Architecture expansions: Attention-free model support (#20811), Hybrid SSM/Attention models on V1 (#20016), LlamaForSequenceClassification (#20807), expanded Mamba2 layer support (#20660).
  • VLM improvements: VLM support with transformers backend (#20543), PrithviMAE on V1 engine (#20577).

Engine Core

  • Experimental async scheduling --async-scheduling flag to overlap engine core scheduling with GPU runner (#19970).
  • V1 engine improvements: backend-agnostic local attention (#21093), MLA FlashInfer ragged prefill (#20034), hybrid KV cache with local chunked attention (#19351).
  • Multi-task support: models can now support multiple tasks (#20771), multiple poolers (#21227), and dynamic pooling parameter configuration (#21128).
  • RLHF Support: new RPC methods for runtime weight reloading (#20096) and config updates (#20095), logprobs mode for selecting which stage of logprobs to return (#21398).
  • Enhanced caching: multi-modal caching for transformers backend (#21358), reproducible prefix cache hashing using SHA-256 + CBOR (#20511).
  • Startup time reduction via CUDA graph capture speedup via frozen GC (#21146).
  • Elastic expert parallel for dynamic GPU scaling while preserving state (#20775).

Hardwares & Performance

  • NVIDIA Blackwell/SM100 optimizations: CUTLASS block scaled group GEMM for smaller batches (#20640), FP8 groupGEMM support (#20447), DeepGEMM integration (#20087), FlashInfer MoE blockscale FP8 backend (#20645), CUDNN prefill API for MLA (#20411), Triton Fused MoE kernel config for FP8 E=16 on B200 (#20516).
  • Performance improvements: 48% request duration reduction via microbatch tokenization for concurrent requests (#19334), fused MLA QKV + strided layernorm (#21116), Triton causal-conv1d for Mamba models (#18218).
  • Hardware expansion: ARM CPU int8 quantization (#14129), PPC64LE/ARM V1 support (#20554), Intel XPU ray distributed execution (#20659), shared-memory pipeline parallel for CPU (#21289), FlashInfer ARM CUDA support (#21013).

Quantization

  • New quantization support: MXFP4 for MoE models (#17888), BNB support for Mixtral and additional MoE models (#20893, #21100), in-flight quantization for MoE (#20061).
  • Hardware-specific: FP8 KV cache quantization on TPU (#19292), FP8 support for BatchedTritonExperts (#18864), optimized INT8 vectorization kernels (#20331).
  • Performance optimizations: Triton backend for DeepGEMM per-token group quantization (#20841), CUDA kernel for per-token group quantization (#21083), CustomOp abstraction for FP8 (#19830).

API & Frontend

  • OpenAI compatibility: Responses API implementation (#20504, #20975), image object support in llm.chat (#19635), tool calling with required choice and $defs (#20629).
  • New endpoints: get_tokenizer_info for tokenizer/chat-template information (#20575), cache_salt support for completions/responses (#20981).
  • Model loading: Tensorizer S3 integration with arbitrary arguments (#19619), HF repo paths & URLs for GGUF models (#20793), tokenization_kwargs for embedding truncation (#21033).
  • CLI improvements: --help=page option for enhanced help documentation (#20961), default model changed to Qwen3-0.6B (#20335).

Dependencies

  • Updated PyTorch to 2.7.1 for CUDA (#21011)
  • FlashInfer updated to v0.2.8rc1 (#20718)

What's Changed

Read more

v0.10.0rc2

24 Jul 05:04
6d8d0a2

Choose a tag to compare

v0.10.0rc2 Pre-release
Pre-release

What's Changed

Read more