Releases: NVIDIA/TensorRT-LLM
v1.3.0rc5
Highlights
-
Model Support
-
API
-
Feature
- Add cache transfer setup for Mamba states (#10934)
- Optimize MoE export by tracing with reduced experts and expanding graph (#11504)
- Add new Helix kernels for MNNVL-based codepath (#11433)
- Add
line_profilertool for host overhead analysis (#11232) - Enable multi-stream MoE; add multi-stream MLA attention (#11520)
- Add MoE all-to-all paradigm (#10985)
- Add support for multi instances in Triton backend with PyTorch backend (#11153)
- Add KV cache metrics to
MetricsCollectorfor more Prometheus metrics (#11243) - Account for reusable KV cache blocks in capacity calculation (#11490)
- Add CUDA graphs, torch compile, NVTX, and warmup for Visual Gen (#11554)
- Make preprocessing async (#11459)
- Split up
TorchSampler.Store(#11566)
-
Fix
- Fix multimodal placeholder counts (#11461)
- Add
cacheSaltIDproperty toBlockKeyserialization (#11457) - Fix cache transceiver (#11409)
- Declare the variable in the correct scope (#11066)
- Fix spec-dec mode flag and related C++ requirements (#10996)
- Fix Qwen3-VL-Dense/MoE accuracy drop (#11134)
- Complete WAR for
popenin QA env (#11214) - Improve error message for mismatched MPI world size (#11294)
- Use the
torch_dtypeset by ModelOpt (#11525) - Fix silent MPI failures on models with custom tokenizers (#11399)
- Fix Nemotron issues (#11425)
- Fix pipeline parallelism + disaggregated serving (#11509)
- Fix broken LLMAPI config (#11571)
- Fix illegal memory access with Helix CP=64 (#11593)
- Validate requests outside sampling loop (#11584)
- Correct chunked prefill handling in
TorchSampler(#11544) - Fix SpecDec sampling seed (#11081)
- Prevent NIXL agent name collision in containerized disaggregated serving (#11552)
-
Documentation
- Add doc for TRTLLM AIGV initial release (#11489)
- Update hardware support (#10719)
- Add documentation on configuring CPU affinity in TRT-LLM (#10678)
- Add warning about 2-model MTP deprecation (#11043)
- Update media file paths in Skip Softmax blog (#11540)
- Update TAVA architecture diagrams for visual gen flow and auto deploy flow (#11523)
- Add Qwen3.5 and GLM 4.7 Flash to support matrix (#11594)
-
Benchmark
- Add ctx-only and gen-only disaggregated perf tests (#11361)
-
Test & Infra
- Add CUTEDSL MoE backend for DeepSeek R1 NVFP4 checkpoint in stress test (#10920)
- Update MIG tests (#11014)
- Fix Slurm job name (#11265)
- Ensure
TorchSamplerdoes not sync (#11508) - Revert MoE unit tests refactor: add unified ConfigurableMoE test framework (#11532)
- Re-upgrade GHA for blossom-ci workflow (#11483)
- Stop using remotes in the Conan install build step (#11516)
- Update PLC pipeline (#11547, #11597)
- Fix testdb file for
l0_b200_multi_gpus_perf_sanity(#11603) - Add
visual_genCODEOWNERS paths (#11606)
What's Changed
- [None][chore] Adjust waive to avoid sm parsing by @tburt-nv in #11518
- [None][chore] Optimize MOE export by tracing with reduced experts and expanding graph by @suyoggupta in #11504
- [#11170][fix] Fix for mm placeholder counts by @2ez4bz in #11461
- [None][feat] Add new helix kernels for MNNVL-based codepath by @brb-nv in #11433
- [TRTLLM-11016][fix] Add cacheSaltID property to BlockKey serialization code by @thorjohnsen in #11457
- [https://nvbugs/5880261][fix] fix cacheTransceiver by @chuangz0 in #11409
- [None][doc] Add doc for TRTLLM AIGV initial release by @chang-l in #11489
- [TRTLLM-10851][feat] Add line_profiler tool for host overhead analysis. by @hyukn in #11232
- [None][chroe] Mass integration of release/1.2 - 4th by @dominicshanshan in #11500
- [None][feat] Use new index api, add block scale support, fix max_seq_len esitmation, add flash mla support by @yizhang-nv in #11334
- [#11455][bug] Use the torch_dtype set by ModelOpt by @tcherckez-nvidia in #11525
- [#10345][perf] Enable multi-stream MOE for super. Also adds multi-stream MLA attn by @suyoggupta in #11520
- [TRTLLM-10030][test] ensure that TorchSampler does not sync by @ixlmar in #11508
- [None][revert] - Revert "[TRTLLM-9108][feat] refactor MoE unit tests: add unified ConfigurableMoE test framework" by @chzblych in #11532
- [None][fix] Better error message for mismatched MPI world size by @jthomson04 in #11294
- [#11109][feat] AutoDeploy: GLM 4.7 Flash Improvements by @bmarimuthu-nv in #11414
- [None][doc] Update media files path in Skip Softmax blog. by @bobboli in #11540
- [#11318][infra] AutoDeploy: Add fused rope kernel - triton_rope_on_interleaved_qk_inputs by @bmarimuthu-nv in #11327
- [None][chore] Waive failing pre-merge test by @brb-nv in #11551
- [None][chore] Waive moe fp4 test by @brb-nv in #11558
- [None][chore] Bump version to 1.3.0rc5 by @yuanjingx87 in #11557
- [TRTLLM-10845][feat] Add dynamic llmapi defaults system by @venkywonka in #11035
- [https://nvbugs/5888464][fix] Stop using remotes in the Conan install build step by @tburt-nv in #11516
- [None][chore] TAVA architecture diagram updates for visual gen flow and auto deploy flow by @yibinl-nvidia in #11523
- [TRTLLM-10064][feat] MoE all-to-all paradigm by @greg-kwasniewski1 in #10985
- [TRTLLM-8263][feat] Add ctx-only and gen-only Disagg Perf Tests by @chenfeiz0326 in #11361
- [TRTLLM-10037][chore] Re-upgrade GHA for blossom-ci workflow by @dpitman-nvda in #11483
- [None][feat] Add support for multi instances in Triton backend with pytorch backend by @achartier in #11153
- [None][fix] Fix silent MPI failures on models with custom tokenizers by @jthomson04 in #11399
- [None][infra] PLC pipeline update by @yuanjingx87 in #11547
- [TRTLLM-10827][feat] Add KV Cache metrics to MetricsCollector for more Prometheus metrics by @yijingl-nvidia in #11243
- [https://nvbugs/5880313][fix] Fix pp + disagg by @Tabrizian in #11509
- [None][infra] Waive unittest that consistently timed out by @yuanjingx87 in #11580
- [TRTLLM-1543][feat] Account for reusable KV cache blocks in capacity … by @SimengLiu-nv in #11490
- [None][feat] Visual Gen: add cuda graphs; torch compile; nvtx; warmup by @NVShreyas in #11554
- [TRTLLM-9040][perf] Make preprocessing async by @2ez4bz in #11459
- [#11440] [feat] AutoDeploy : Support Qwen3.5 by @bmarimuthu-nv in #11394
- [#11292][feat] use smg-grpc-proto package for gRPC proto definitions by @CatherineSue in #11578
- [None][doc] Add Qwen3.5, GLM 4.7 Flash to support matrix by @bmarimuthu-nv in #11594
- [None][feat] AutoDeploy: Add nemotron v2 acc test by @nvchenghaoz in #11429
- [#11569][fix] Fix broken LLMAPI config by @2ez4bz in #11571
- [None][chore] split up TorchSampler.Store by @ixlmar in #11566
- [None][fix] Read mamba_ssm_cache_dtype from HF config when set to auto by @tomeras91 in #11582
- [https://nvbugs/5914959][fix] Fix illegal memory access with Helix CP=64 by @brb-nv in #11593
- [#10243][feat] Add TRT-LLM attention backend to AutoDeploy by @MrGeva in #11430
- [TRTLLM-10857][chore] Move SaveHiddenStates spec dec mode to 1 model by @mikeiovine in #11241
- [TRTLLM-10197][feat] Cache Transfer Setup for Mamba States by @NVShreyas in #10934
- [TRTLLM-11069][fix] validate requests outside sampling loop by @ixlmar in #11584
- [None][fix] correct chunked prefill handling in TorchSampler by @ixlmar in #11544
...
v1.3.0rc4
Highlights
-
Model Support
-
API
- Add user-provided UUID support for multimodal KV cache identification (#11075)
-
Feature
- Support GB200 and increase disagg test timeout (#11019)
- Avoid syncs in beam search and other improvements (#11349)
- Implement disaggregated harmony chat (#11336)
- Support different KV cache layout for one-model spec dec (#10502)
- Reduce attention module repeated warnings (#11335)
- Make update_weights compatible with CUDA Graph (#11267)
- Fully non-blocking pipeline parallelism executor loop (#10349)
- Move MambaCacheManager from Python to C++ (#10540)
- Pin host memory and batch sampler setup in beam search (#11390)
- Initial PR for trtllm-gen attention backend (#10784)
- Remove the hard code for activation type definition in TRTLLM Moe Backend (#11164)
- Merge residual+hidden into layer norm at the end of each NemotronH MTP, and remove a % operation (#11406)
- Introduce an abstract WaitingQueue interface to decouple the request scheduling logic from specific queue implementations (#11330)
- Add BOLT compatible build flags for further experimental usage (#11297)
- Multi-image support for EPD disagg (#11264)
- Optimize NemotronH model with elementwise and nvfp4 fusion (#11273)
- TorchSampler general host time optimization (#11141)
-
Fix
- Disaggregated serving: Only send finished context requests to the KV cache transceiver (#11354)
- Replace etcd3 with etcd-sdk-python (#10886)
- Fix offset calculation in _are_stop_words when using speculative decoding (#10854)
- Fix hang issue by avoid exposing UB buf… (#10842)
- WAR for popen in QA env (#10989)
- Fix Eagle3 draft model weight loading for throughput checkpoint (#11010)
- Respect CUDA_LAUNCH_BLOCKING by fixing doCheckError (#11261)
- Avoid reserved filename on Windows (#11382)
- Fix tinygemm accuracy (#11411)
- Disable cutedsl argmax kernel to fix perf regression (#11403)
- Fix deepeplowlatency with trtllm moe backend running fp8 DS_R1 (#11266)
- Gracefully terminate disagg serving servers to prevent leftover subprocess warnings (#11395)
- Update lm_eval to 4.9.10 and re-enable Skip Softmax Attention tests on CI (#11176)
- Remove overlap scheduler adjustment for max sequence length in create_py_executor function (#9229)
- Fix out-of-bounds array access in kernel factory Get() methods (#11373)
- Fix a bug in PR11336 (#11439)
- Fix GLM engine build dtype (#11246)
- Enable warmup for Helix CP (#11460)
- Pre-Allocation for Auto-Tuning NCCL_SYMMETRIC (#11326)
- Make NVML work with older CUDA driver versions (#11465)
- Fallback to triton_ssm for nvfp4 quantization (#11456, #11455)
- Fix CUDA OOM error (#11219)
-
Documentation
-
Benchmark
-
Test & Infra
- Move test_trtllm_flashinfer_symbol_collision.py to tests/unittest/_torch (#11168)
- Fix missing test cases (#10881)
- Update test constraint (#11054)
- Add CODEOWNERS coverage for serve/ and commands/ directories (#11359)
- Update model list (#11364)
- Unit test for disagg gen cancellation (#11108)
- Disable spark stages due to migration of spark cloud (#11401)
- Enable sparck ci since spark cloud migration is done (#11407)
- Upload unittest sub results in slurm (#10834)
- Remove obsolete code (#11388)
- Fix the testcase name in timeout xml (#9781)
- Use frontend dgx-h100 and b200 slurm platforms (#11251)
- Update allowlist 2026-02-10 (#11426)
- Lock FI version to 0.6.3 (#11371)
- Pin the torchao version (#11444)
- Refactor finish reasons tests (#11445)
- Move DeepEPLowLatency tests to machines that support IBGDA with GPU handles (#11178)
- Refactor MoE unit tests: add unified ConfigurableMoE test framework (#11437)
- Use weakref in atexit handler (#11476)
- Improve assert in sampler (#11475)
- Update allowlist 2026-02-13 (#11512)
What's Changed
- [None][infra] Waive failed case for main branch on 02/09 by @EmmaQiaoCh in #11369
- [None][chore] Move test_trtllm_flashinfer_symbol_collision.py to tests/unittest/_torch by @yihwang-nv in #11168
- [None][chore] Add microbench for MoE Comm methods. by @bobboli in #10317
- [https://nvbugs/5829097][fix] Disaggregated serving: Only send finished context requests to the KV cache transceiver by @Funatiq in #11354
- [None][test] Enhance multi-GPU tests for IFB stats by @Funatiq in #11239
- [https://nvbugs/5834212][chore] unwaive test_disaggregated_mixed by @reasonsolo in #11372
- [#10780][feat] AutoDeploy: Support per-expert scales in FP8 and NVFP4 MoE by @galagam in #11322
- [TRTLLM-10030][perf] avoid syncs in beam search + other improvements by @ixlmar in #11349
- [None][chroe] Mass integration of release/1.2 - 3rd by @dominicshanshan in #11308
- [None][fix] Respect CUDA_LAUNCH_BLOCKING by fixing doCheckError by @hnover-nv in #11261
- [TRTLLM-10866][feat] implement disaggregated harmony chat by @reasonsolo in #11336
- [None][infra] AutoDeploy: Dump graph IR after every transform by @bmarimuthu-nv in #11045
- [None][chore] update model list by @tcherckez-nvidia in #11364
- [None][chore] Unit test for disagg gen cancellation by @pcastonguay in #11108
- [https://nvbugs/5853997][chore] Unwaive gpt-oss test by @mikeiovine in #11287
- [TRTLLM-10321][feat] Support different KV cache layout for one-model spec dec by @ziyixiong-nv in #10502
- [https://nvbugs/5855540][fix] AutoDeploy: thread cleanup of eagle test by @lucaslie in #11289
- [None][chore] Reduce attention module repeated warnings. by @yuxianq in #11335
- [https://nvbugs/5843112][chore] Unwaive ngram test by @mikeiovine in #11320
- [None][test] Add DGX-Spark multinode perf cases by @JennyLiu-nv in #11184
- [None][fix] Avoid reserved filename on Windows by @tongyuantongyu in #11382
- [None][infra] Disable spark stages due to migration of spark cloud by @EmmaQiaoCh in #11401
- [TRTC-265][chore] Add CODEOWNERS coverage for serve/ and commands/ directories by @venkywonka in #11359
- [#11032][feat] MLA revisited and GLM 4.7 Flash support by @lucaslie in #11324
- [TRTC-264][doc] Add CLAUDE.md and AGENTS.md by @venkywonka in #11358
- [None][chore] Mass merge commits from release/1.2.0rc6.post1 branch by @longlee0622 in #11384
- [TRTLLM-9771][feat] Make update_weights compatible with CUDA Graph by @shuyixiong in #11267
- [None][infra] Enable sparck ci since spark cloud migration is done by @EmmaQiaoCh in #11407
- [None][doc] add multiple-instances section in disaggregated serving doc by @reasonsolo in #11412
- [None][feat] Fully non-blocking pipeline parallelism executor loop. by @yuxianq in #10349
- [None][infra] Waive failed cases for main branch on 02/10 by @EmmaQiaoCh in #11413
- [None][chore] Unwaive tests after last MI by @dominicshanshan in #11400
- [TRTLLM-10331][infra] Upload unittest sub results in slurm by @yiqingy0 in #10834
- [https://nvbugs/5791242][chore] remove obsolete code by @ixlmar in #11388
- [None][fix] fix tinygemm accuracy by @bo-nv in #11411
- [https://nvbugs/5853720][fix] Disable cutedsl argmax kernel to fix perf regression by @chenfeiz0326 in #11403
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #11363
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #11396
- [TRTLLM-9711][infra] Fix the testcase name in timeout xml by @yiqingy0 in #9781
- [https://nvbugs/5848377][fix] fix deepeplowlatency with trtllm moe backend running fp8 DS_R1 by @leslie-fang25 in #11266
- [TRTLLM-10273][feat] Move MambaCa...
v1.3.0rc3
Highlights:
Model Support
- Support LoRa BF16 checkpoints with Llama 3.3-70B FP8 (#9808)
- Add Eagle3 support for Nemotron H (#11131)
- Enhance support for complex models (#11254)
API
- Allow overriding quantization configs (#11062)
- Set continuous_usage_stats default to False to follow OpenAI protocol (#10644)
- Set max_num_tokens_in_buffer default based on max_seq_len/max_input_len (#11082)
Feature
- Export ONNX for DriveOS LLM (#10117)
- Add L2 norm pattern matcher and fusion transform (#10767)
- Add PDL support for moeAlltoAllKernels (#10591)
- Integrate KVCacheManager V2 into TRTLLM runtime (#10659)
- Integrate cuda.tile RMS norm kernels (#9725)
- Refactor request fetching logic for better separation of concerns (#10988)
- Implement gen-first disagg_service (#11020)
- Support disagg SLURM job rescheduling (#11218)
- Improve layer classification for sharding (#10718)
- Add priority-based KV cache offload filtering (#10751)
- Optimize beam search performance (remove GPU sync, fix batching, refactor) (#11276)
- Avoid sync in PyTorchModelEngine when using beam search (#11341)
- Adjust DeepGEMM tuning buckets for larger num_tokens scope (#11259)
- Add CuteDSL FP8 GEMM for Blackwell (#10130)
- Reduce host memory usage during model loading (#11119)
- Perfect routing for Deepseek models (#11127)
- Modularize transceiver for KV manager v2 (step 4) (#11225)
Fix
- Fix AttributeError with return_perf_metrics on TensorRT backend (#10662)
- Prevent routing context and generation requests to the same worker; document unique disagg ID (#11095)
- Prevent out-of-bounds read (#10868)
- Add __syncthreads() to TinyGEMM to resolve intermittent accuracy issues (#10873)
- Fix PD disaggregation for VLMs that use mrope (#10865)
- Always reset drafting states for GuidedDecoder (#10899)
- Use NCCL as fallback to avoid crash due to insufficient memory (#10928)
- Fix llama sm120 spec decoding (#10765)
- Fix MTP one-model sampler (#10369)
- Align kv_scales with ModelOpt HF checkpoint (#10745)
- Fix selective_state_update perf regression for T=1 decode path (#11194)
- Make health_generate work with beam search (#11097)
- Work around accuracy issue by enforcing paged_context_fmha on Hopper for fmha_v2 (#11192)
- Fix CuteDSL argmax on sm120 (#11181)
- Fix amax to avoid NaN issue in fp8_blockscale_gemm_kernel (#11256)
- Fix VSWA initialization with spec-dec and boundary condition in context input preparation (#10798)
- Fix partial reuse disabled for disagg (#11247)
- Retake ownership of mrope tensors in prefill worker (#11217)
- Fix proto-to-SamplingParams conversion bugs and add gRPC tests (#11292)
- Fix accuracy drop in VSWA with KV cache block reuse (#10875)
Documentation
- Add Glm4MoeForCausalLM to model support matrix (#11156)
- Fix GLM4-MoE Eagle support documentation (#11198)
- Add CUDA Graph + LoRA to feature combination matrix (#11187)
- Fix comments for KV cache manager v2 (#11207)
- Skip Softmax Attention blog and docs (#10592)
- Add sparse attention docs to index (#11342)
Test & Infra
- Update GB200 test configs to use frontend SLURM platforms (#11085)
- Fix jaraco-context and wheel vulnerability (#10901)
- Add --high-priority in bot help message (#11133)
- Print memory usage before/after accuracy test in CI (#11155)
- Fix mocking of HuggingFace downloads in with_mocked_hf_download (#11200)
- Set rerun report stage UNSTABLE and pipeline SUCCESS when rerun tests pass (#11210)
- Move 6x H100 test stage to AIHub platform (#11039)
- Add disagg perf tests (#10912)
- Provide uniform test framework to test all MoE backends (#11128)
- Move disagg scripts env configs from bash to submit.py (#10223)
- Use free port for serve test (#10878)
- Fix test_auto_scaling for 2 GPUs (#10866)
- Update test list (#10883)
- Fix an invalid test name (#11195)
- Refine QA test list for SM120 (#11248)
- Fix multimodal serve test (#11296)
- Pass without_comm to Cutlass and DeepGEMM (#11229)
- Promote SampleState to TypeVar and fix typing (#11281)
- Fix bench script test (#10483)
What's Changed
- [None][feat] Export ONNX for DriveOS LLM by @nvyocox in #10117
- [#9525][feat] add L2 norm pattern matcher and fusion transform by @karthikvetrivel in #10767
- [TRTINFRA-7548][infra] Update GB200 test configs to use frontend SLURM platforms by @mlefeb01 in #11085
- [None][doc] Add Glm4MoeForCausalLM to model support matrix by @venkywonka in #11156
- [None][feat] Perfect routing for Deepseek models by @brb-nv in #11127
- [TRTLLM-10398][feat] Enable TRTLLM moe backend for Nemotron Super by @nv-guomingz in #10791
- [#8242][feat] Add int4 GPTQ support for AutoDeploy by @Fridah-nv in #8248
- [https://nvbugs/5804683][infra] unwaive Mistral Large3 test by @byshiue in #10680
- [TRTLLM-9771][feat] Allow overriding quantization configs by @shuyixiong in #11062
- [None][ci] Waive a flaky test on A10 by @chzblych in #11163
- [None][infra] Waive failed cases for main on 1/30 by @EmmaQiaoCh in #11142
- [None][fix] AttributeError with return_perf_metrics on tensorrt backend by @riZZZhik in #10662
- [https://nvbugs/5834212][fix] prevent routing ctx and gen requests to the same worker; update doc for unique disagg ID by @reasonsolo in #11095
- [TRTLLM-10666][chore] Refactor request fetching logic for better separation of concerns by @lancelly in #10988
- [https://nvbugs/5823284][fix] Unwaive no repro hang issue by @liji-nv in #11138
- [None] [feat] Add PDL support for moeAlltoAllKernels by @kaiyux in #10591
- [None][infra] Waive failed cases and disable a stage on 02/02 by @EmmaQiaoCh in #11177
- [TRTLLM-9766][feat] Integration of the KVCacheManager V2 to TRTLLM Runtime by @yizhang-nv in #10659
- [None][chroe] Mass integration of release/1.2 - 2nd by @dominicshanshan in #11088
- [None][feat] Integrate cuda.tile RMS norm kernels by @lirundong in #9725
- [None][test] Fix an invalid test name by @chzblych in #11195
- [None][feat] Nemotron H: Eagle3 support by @IzzyPutterman in #11131
- [#10826][feat] AutoDeploy: Eagle One-Model [2/n]: Prefill-Only Implementation by @govind-ramnarayan in #11073
- [None][doc] Fix GLM4-MoE Eagle support documentation by @venkywonka in #11198
- [TRTLLM-10561][infra] Fix jaraco-context and wheel vulnerability by @yiqingy0 in #10901
- [TRTLLM-10307][infra] Add --high-priority in bot help message by @mzweilz in #11133
- [None][chore] Print memory usage before/after accuracy test in CI by @taylor-yb-lee in #11155
- [TRTLLM-10803][fix] Fix mocking of HuggingFace downloads in
with_mocked_hf_downloadby @anish-shanbhag in #11200 - [None][chore] Add failed cases into waives.txt by @xinhe-nv in #11193
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #11202
- [TRTLLM-10839][infra] Set rerun report stage UNSTABLE and pipeline SUCCESS in post-merge when there are passed rerun tests by @yiqingy0 in #11210
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #11216
- [None][fix] Align kv_scales with modelopt HF checkpoint by @cjluo-nv in #10745
- [https://nvbugs/5739981][fix] unwaive tests using opt-125M by @ixlmar in #11100
- [TRTLLM-10019][infra] Move 6 h100 test stage to aihub platform by @yuanjingx87 in #11039
- [TRTLLM-8921][feat] implement gen-first disagg_service by @reasonsolo in #11020
- [#11086][feat] Optimize Auto Deploy weight loading by preloading weights to CPU by @taylor-yb-lee in #11059
- [None][fix] Set continuous_usage_stats default to False to follow OpenAI protocol by @riZZZhik in #10644
- [None][chore] bump version to 1.3.0rc3 by @tburt-nv in #11238
- [TRTLLM-8263][feat] Add Disagg Perf Tests by @chenfeiz0326 in #10912
- [None][fix] Fix selective_state_update perf regression for T=1 decode path by @galagam in #11194
- [TRTLLM-9111][feat] provide the uniform test framework to test all MoE backends by @xxi-nv in #11128
- [None][fix] make health_generate work with beam search by @ixlmar in https://github.com/NVIDIA/TensorRT...
v1.2.0rc6.post3
What's Changed
- [https://nvbugs/5850094][fix] Fix MoE cost estimation for auto multi-stream scheduling by @yizhang-nv in #11160
- [None][feat] update TRT-LLM Gen DS FP8 MoE cubins and optimize finalize kernel by @nekorobov in #11104
- [None][chore] Bump version to 1.2.0rc6.post3 by @yiqingy0 in #11224
- [None][fix] Fallback to NCCL instead of NCCL symmetric by @Tabrizian in #11174
- [None][feat] fuse shared to sparse experts in TRT-LLM Gen MoE by @nekorobov in #11143
Full Changelog: v1.2.0rc6.post2...v1.2.0rc6.post3
v1.2.0rc2.post2
What's Changed
- [None][fix] fix TinyGemm accuracy issue. cherry-pick #10619 and #10873 by @bo-nv in #10990
- [None][chore] Bump version to 1.2.0rc2.post2 by @chzblych in #11012
- [None][chore] Upgrade starlette and FastAPI (#9319) by @chzblych in #11027
- [None][fix] fix accuracy issue(cherry-pick #11157 and #9530) by @bo-nv in #11222
Full Changelog: v1.2.0rc2.post1...v1.2.0rc2.post2
v1.3.0rc2
Highlights:
-
Known Issues
- On RTX6000D, one might encounter Instruction 'redux.f32' not supported error. This issue will be resolved in the next release.
-
Model Support
-
API
-
Feature
- Add Skip Softmax MLA kernels for Blackwell and fix NVFP4 KV accuracy bug (#10813)
- Fuse AllGather for expert statistics required by EPLB (#10885)
- Add first-iteration streaming for GPT-OSS in
trtllm-serve(#10808) - Integrate CuteDSL argmax kernel (#10476)
- Update Mamba decode kernel to FlashInfer (#10757)
- Improve effective memory bandwidth with TMA.RED (#10987)
- Reorganize AutoTuner cache file for distributed tuning (#10956)
- Support attention DP + Helix CP (#10477)
- Improve performance of
_write_finish_reasonsin TorchSampler (#10459) - Add gRPC server for high-performance external router integration (#11037)
- Prepare for future KVCacheV2 MTP support (#11029)
-
Fix
- Fix CuteDSL MoE unit test (#10983)
- Fix overlap scheduler
pause()timing (#10943) - Fix Pydantic deepcopy bug (#11004)
- Restore IPv6 support in
serve.py(#10929) - Fix conditional compilation for sm10x cubins (#10839)
- Add graceful fallbacks for NCCL symmetric mode (#11042)
- Fix
enable_alltoallpassed to CutlassFusedMoE (#11016) - Fix kvCacheManager
isLeaf()assertion failure (#10922) - Add null pointer check to
parseNpyHeader(#10944) - Fix attention DP scheduling sort order to prioritize non-relaxed requests (#11106)
-
Documentation
- Update Qwen2/3-VL models in
supported_models.md(#10797)
- Update Qwen2/3-VL models in
-
Benchmark
-
Test & Infra
- Add 250K-token NVFP4 MoE + PDL regression tests (#10911)
- Add timeout for SeedOSS test (#8683)
- Add Fake Ops for one-sided AlltoAll (#11002)
- Refactor setup for RNN cache transceiver (#10957)
- Change SLURM config access to use
resolvePlatform(#11006) - Update CI allowList (#11040)
- Add Mamba and MLA layers to sharding tests (#10364)
- Remove pybind11 bindings and references (#10550, #11026)
- Add multi-acc and Lyris GB200 test support (#11024)
- Package
triton-kernelsas a dependency (#10471) - Fix Qwen3 Eagle test (#11030)
- Dump thread stacks for hanging tests before timeout (#10708)
- Remove
-ccachefrombuild_wheel.pyargs (#11064) - Fix
trtllm-serveguided decoding test (#11101) - Remove invalid account for Blossom CI (#11126)
- Add source code pulse scan to PLC nightly pipeline (#10961)
What's Changed
- [None][fix] Fix CuteDSL MoE unittest by @syuoni in #10983
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #10974
- [https://nvbugs/5661741][feat] Add 250K-token NVFP4 MoE + PDL regression tests by @yingguo-trt in #10911
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #10976
- [None][infra] Waive failed case for main branch on 01/26 by @EmmaQiaoCh in #10994
- [None][feat] Add Skip Softmax MLA kernels for Blackwell and Fix an accuracy bug of NVFP4 KV by @Tom-Zheng in #10813
- [TRTLLM-10048][feat] Fuse the AllGather for expert statistics required by the EPLB. by @bobboli in #10885
- [https://nvbugs/5794796][fix] Cherry-pick #10855: Unwaive Llama 3.3 related multi GPU tests by @pengbowang-nv in #10942
- [#10614][fix] gpt_oss first iteration streaming in trtllm-serve by @LinPoly in #10808
- [None][chore] Removing pybind11 bindings and references by @Linda-Stadter in #10550
- [#8982][feat] AutoDeploy attention dp support by @lucaslie in #10728
- [None][chore] update AD model list by @tcherckez-nvidia in #10981
- [TRTLLM-10062][feat] Enable MTP for Nemotron Super by @sunnyqgg in #10754
- [TRTLLM-10276][feat] Integrate cutedsl argmax kernel by @ameynaik-hub in #10476
- [TRTLLM-10453][feat] Update mamba decode kernel to flashinfer by @Wanli-Jiang in #10757
- [TRTLLM-10560][fix] Fix the time of pause() for overlap scheduler by @yuantailing in #10943
- [https://nvbugs/5612438][fix] Add timeout for SeedOSS test by @zhhuang-nv in #8683
- [None][infra] Waive failed cases for main on 01/27 by @EmmaQiaoCh in #11017
- [None][chore] Bump version to 1.3.0rc2 by @yiqingy0 in #11021
- [None][chore] Remove closed bugs by @xinhe-nv in #10982
- [#10889][fix] fix pydantic deepcopy bug by @reasonsolo in #11004
- [TRTLLM-9390][chore] Add Fake OPs for One-Sided AlltoAll. by @bobboli in #11002
- [TRTLLM-9831][perf] Use TMA.RED to improve effective memory bandwidth by @sherry-1001 in #10987
- [TRTLLM-9527][feat] change context params and disagg params (step3) by @chuangz0 in #10495
- [TRTLLM-10308][feat] AutoTuner Cache: reorganize cache file for distributed tuning by @hyukn in #10956
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #10993
- [https://nvbugs/5843316][chore] waive overlap_scheduler test by @galagam in #11025
- [#10013][feat] AutoDeploy: native cache manager integration by @lucaslie in #10635
- [https://nvbugs/5721661][chore] Unwaive fixed bug. by @SimengLiu-nv in #11009
- [#10877][fix] restore ipv6 support in serve.py by @Evgueni-Petrov-aka-espetrov in #10929
- [TRTLLM-10197][chore] Refactor to setup for RNN cache transceiver by @NVShreyas in #10957
- [TRTINFRA-7379][infra] Change SLURM config access to use resolvePlatform by @mlefeb01 in #11006
- [None][fix] Proper conditional compilation of sm10x cubins by @tongyuantongyu in #10839
- [https://nvbugs/5756804][fix] Re-enable passing test by @dongfengy in #10986
- [None][fix] unwaive tests by @xinhe-nv in #11047
- [https://nvbugs/5779536][fix] Cherry-pick #10902: Unwaive DeepSeekR1 nvfp4 pp4 mtp test case (#10902) by @pengbowang-nv in #11000
- [None][infra] Update CI allowList by @yuanjingx87 in #11040
- [TRTLLM-10362][feat] Added Mamba and MLA layers to the sharding tests by @greg-kwasniewski1 in #10364
- [None][chore] Removing cpp/tensorrt_llm/pybind by @Linda-Stadter in #11026
- [None][feat] support multi_acc and Lyris GB200 test by @yingguo-trt in #11024
- [None][infra] Waive failed cases for main on 1/28 by @EmmaQiaoCh in #11053
- [None][chore] AutoDeploy: Eagle One-Model [1/n]: PyTorch impl for Eagle3 checkpoint by @govind-ramnarayan in #10674
- [#10245][feat] AutoDeploy: Add Minimax M2 support by @bmarimuthu-nv in #10525
- [None][fix] nccl symmetric with graceful fallbacks by @nv-lschneider in #11042
- [None][fix] fix Qwen2/3 export for AutoDeploy by @Fridah-nv in #11007
- [None][fix] No need to remove the original waive list by @yiqingy0 in #11060
- [https://nvbugs/5761391][fix] Include triton-kernels as a packaged dependency by @anish-shanbhag in #10471
- [None][fix] Fix enable_alltoall passed to CutlassFusedMoE by @syuoni in #11016
- [None][feat] Add performance alignment to layer-wise benchmarks by @yuantailing in #11018
- [https://nvbugs/5813452][fix] Fix "Assertion failed: isLeaf() in kvCacheManager.cpp:465" by @Boreas618 in #10922
- [None][infra] Waived flaky tests by @ZhanruiSunCh in #11091
- [TRTLLM-10264][feat] Support attention DP + Helix CP by @brb-nv in #10477
- [TRTLLM-10415][feat] Dump thread stacks for hanging tests before time… by @WeiHaocheng in #10708
- [TRTLLM-10312][perf] Improve performance of _write_finish_reasons in TorchSampler by @stnie in https:/...
v1.3.0rc1
Highlights
-
Model Support
-
API
-
Feature
- Update disagg slurm scripts (#10712)
- Re-implement MicroBatchScheduler and CapacityScheduler in Python (#10273)
- Fix sharding dashboard errors (#10786)
- Async Transfer Manager (#9891)
- Speculative One Model: FlashInfer sampling (#10284)
- Refactor speculative decoding workers (#10768)
- Use global unique id as disagg request id (#10187)
- Enable guided decoding with reasoning parsers (#10890)
- Support partial update weight for fp8 (#10456)
- Multi-LoRA serving with CUDA Graph (#8279)
- Support logprobs for Completions API (#10809)
- Eagle3 Specdec UX improvements (#10124)
- Python transceiver components (step 2) (#10494)
- Upgrade NIXL to v0.9.0 (#10896)
- KV Connector Support for MTP (#10932)
- Support overlap scheduler for disagg ctx instances (#10755)
- Adding implementation of KVCacheManagerV2 (#10736)
- Switch to ConfigurableMoE as the default path (#10792)
-
Fix
- Enable system memory to transfer active message in NIXL ucx (#10602)
- Fix the potential misaligned access due to vectorized ld/st instructions in NVLinkOneSided A2A (#10539)
- Default disable gemm+allreduce fusion (#10656)
- Fix vulnerability urllib3 and nbconvert (#10551)
- Fix overlap scheduler race condition (#10610)
- Replace pickle.load with restricted Unpickler (#10622)
- Fix copy start_logs in disagg slurm scripts (#10840)
- Cherry-pick: Disable short profile for tunable ops with MERGE strategy (#10844, #10715)
- Lock resource to fix potential access to released data (#10827)
- Cherry-pick: Fix accuracy issue of TWO-SHOT AllReduce kernel (#10841, #10654)
- Remove weight tensor holder to release memory earlier (#10876)
- Add missing dist strategy param and fix typo for ad_logger (#10892)
- Update RMSNorm custom op plumbing (#10843)
- Fix hmac launch (#10434)
- Avoid Double update for previous batch (#9888)
- Re-init TRTLLM sampler to use sample stream in multi-stream cases (#10918)
- Mtp with async scheduler (#10941)
- Fix buffer reuse (#10716)
- Cherry-pick: Fix hanging issue for MNNVL Allreduce under PP (#10750, #10633)
- Workaround for flashinfer.sampling.sampling_from_logits (#10713)
- Fix port 8000 being used issue in stress test (#10756)
-
Documentation
-
Test & Infra
- Upload regression info to artifactory (#10599)
- Add sonarqube scanning in lockfile generation pipeline (#10700)
- Add Nemotron Nano v3 FP8 autodeploy perf test (#10603)
- Remove trt flow tests in NIM (#10731)
- Update config.yaml of slurm scripts to align with submit.py change (#10802)
- Add a timeout in MNNVL throughput to prevent hangs if one rank crashes (#9532)
- Trigger multi-gpu tests when install_nixl/ucx.sh is modified (#10624)
- Add DGX-Spark VLM accuracy and perf spec dec cases (#10804)
- Fix test list llm_spark_func.txt (#10921)
- Add test configurable moe module multi gpu (#10699)
- NVFP4 MoE - Move weights transformation to fusion phase (#10803)
- Update flashinfer-python to 0.6.1 (#10872)
- Improve disagg acc tests (#10833)
- Refine placement group in ray executor (#10235)
- Regenerate out dated lock file (#10940)
- Remove long-running sanity check tests on GH200 (#10924, #10969)
- Add dgx-spark beta notes (#10766)
- Modify ctx config in 128k8k disagg cases (#10779)
- Balanced random MoE workload generator for CuteDSL kernel UT, autotuner and layerwise benchmark (#10279)
What's Changed
- [#10696][fix] AutoDeploy prevent torch.export from specializing batch dimension when max_batch_size=1 by @MrGeva in #10697
- [None][infra] Add sonarqube scanning in lockfile generation pipeline by @yuanjingx87 in #10700
- [https://nvbugs/5769712][fix] fix timeout in AutoDeploy llama accuracy test by @lucaslie in #10461
- [#10688][fix] AutoDeploy Fix CUDA graph batch sizes exceeding max_batch_size by @MrGeva in #10687
- [#10642][feat] AutoDeploy: optimized canonicalize_graph utilities [1/2] by @lucaslie in #10675
- [https://nvbugs/5769890][fix] enable system memory to transfer active message in NIXL ucx by @chuangz0 in #10602
- [https://nvbugs/5814247][fix] unwaive AutoDeploy multi-gpu unit tests by @lucaslie in #10769
- [TRTLLM-10300][feat] Upload regression info to artifactory by @chenfeiz0326 in #10599
- [None][chore] Add release/1.2 branch into lockfile generation schedule by @yiqingy0 in #10790
- [TRTLLM-9581][infra] Use /home/scratch.trt_llm_data_ci in computelab by @ZhanruiSunCh in #10616
- [None][infra] Waive failed cases for main on 01/19 by @EmmaQiaoCh in #10794
- [#10607][chore] Add Nemotron Nano v3 FP8 autodeploy perf test by @MrGeva in #10603
- [None][feat] Update disagg slurm scripts by @qiaoxj07 in #10712
- [None][test] adjust the dis-agg test timeout threshold by @Shixiaowei02 in #10800
- [None][chore] docs: clarify LoRA is not supported with --use_fp8_rowwise in Fp8RowwiseAttention (see #2603) by @ssam18 in #10320
- [None][chore] Remove trt flow tests in NIM by @jieli-matrix in #10731
- [None][chore] update config.yaml of slurm scripts to align with submit.py change by @dc3671 in #10802
- [https://nvbugs/5776445][chore] unwaive test by @reasonsolo in #10667
- [TRTLLM-10029][scheduler] Re-implement MicroBatchScheduler and CapacityScheduler in Python by @lancelly in #10273
- [TRTLLM-10296][fix] Fix the potential misaligned access due to vectorized ld/st instructions in NVLinkOneSided A2A. by @bobboli in #10539
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #10776
- [None][fix] default disable gemm+allreduce fusion by @benzh-2025 in #10656
- [TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #10787
- [None][fix] Fix vulnerability urllib3 and nbconvert by @yiqingy0 in #10551
- [None][test] Update sanity test list by @xinhe-nv in #10825
- [None][fix] Remove unused params in attn by @yizhang-nv in #10652
- [TRTLLM-10785][feat] Fix sharding dashboard errors by @greg-kwasniewski1 in #10786
- [https://nvbugs/5701445][chore] unwaive test. by @yuxianq in #10806
- [None][infra] trigger multi-gpu tests when install_nixl/ucx.sh is mod… by @bo-nv in #10624
- [None][infra] Waive failed cases for main branch on 01/20 by @EmmaQiaoCh in #10829
- [None][chore] Reduce tedious logs by @chzblych in #10847
- [#10707][fix] AutoDeploy: Super accuracy test fixes by @galagam in #10717
- [None][chore] Async Transfer Manager by @jthomson04 in #9891
- [None][fix] fix duplicate entry in waives.txt by @lucaslie in #10853
- [None][feat] Speculative One Model: FlashInfer sampling by @IzzyPutterman in #10284
- [https://nvbugs/5670108][fix] Fix overlap scheduler race condition in… by @SimengLiu-nv in #10610
- [https://nvbugs/5760737][test] only skip mooncake+indexerkcache test by @zhengd-nv in #10266
- [https://nvbugs/5759698][fix] unwaive test_base_worker by @Superjomn in #10669
- [None][fix] Add a timeout in MNNVL throughput to prevent hangs if one rank crashes by @djns99 in #9532
- [https://nvbugs/5670458][chore] Unwaive reward model test by @shuyixiong in #10831
- [None][chore] Revert #10847 by @chzblych in #10869
- [https://nvbugs/5775021] [fix] Replace pickle.load with restricted Unpickler by @yibinl-nvidia in #10622
- [None][fix] Fix copy start_logs in disagg slurm scripts by @qiaoxj07 in #10840
- [None][fix] Cherry-pick #10715: Disable short profile for tunable ops with MERGE strategy by @hyukn in #10844
- [https://nvbugs/5740377][fix] Lock resource to fix potential access to released data by @HuiGao-NV in #10827
- [https://nvbugs/5814253][fix] unwaive test_autotuner_di...
v1.2.0rc6.post2
What's Changed
- [None][fix] enable EPLB for DEEPGEMM by @xxi-nv in #10618
- [https://nvbugs/5811697][fix] Fix buffer reuse for release/1.2.0rc6.post1 by @yuxianq in #10734
- [None][fix] impl fused triton kernel for e8m0 resmooth (target release/1.2.0rc6.post1, cherry-pick from #10327 and #10770) by @yuxianq in #10771
- [None][chore] Bump version to 1.2.0rc6.post2 by @yiqingy0 in #10907
Full Changelog: v1.2.0rc6.post1...v1.2.0rc6.post2
v1.3.0rc0
Highlights
-
Model Support
-
API Improvements
- Added processed logprobs functionality to TorchSampler (#9675)
- Added support for image_embeds in OpenAI API (#9715)
- Covered LLM API
multi_modal_embeddings(#9963) - Implemented GET/DELETE v1/responses/{response_id} endpoints (#9937)
- Use RequestError for validation errors to prevent engine shutdown (#9761)
-
Performance Optimizations
- Added Hopper XQA decode support for skip softmax attention (#10264)
- Enabled attention data parallelism for Nemotron Super v3 (#10347)
- Added fp4 GEMM with AllReduce support (#9729)
- Use XQA JIT implementation by default with sliding window perf optimization (#10335)
- Reduced host overhead for unified nvfp4 GEMM tuning path (#10503)
- Implemented fused Triton kernel for e8m0 resmooth to reduce memory footprint (#10327)
-
MoE (Mixture of Experts) Enhancements
- Added ExpertStatistic and DUMMY_ALLREDUCE for configurable MoE (#10401)
- Added test configurable MoE module (#10575)
- Implemented padding empty chunk for configurable MoE (#10451)
- Enabled EPLB for DEEPGEMM (#10617)
- Extended MoE quantization test utilities with comprehensive quant algorithm support (#10691)
-
Disaggregation Features
-
Auto Deploy
-
Fixes
- Fixed PP loop hang caused by i-sending new requests (#10665)
- Avoided write-write race for async PP send (#10488)
- Fixed hang issue when enabling skip softmax on Blackwell (#10490)
- Fixed hanging issue for MNNVL Allreduce under PP (#10633)
- Implemented PP skip forward for all spec workers (#10578)
- Added warning for gen-only paused state (#10664)
- Used uint64_t as dtype of lamport_buffer_size to avoid overflow (#10499)
- Fixed HelixCpMnnvlMemory initialization with PP (#10533)
- Fixed regression in KV cache resize memory estimation (#10726)
- Prevented out-of-bounds read (#9879)
- Solved pillow version conflict (#10537)
- Support to parse the keyword modules_to_not_convert of HF model config (#10527)
- Used correct model names for config database regression tests (#10192)
- Support GuidedDecoder with sharded logits (#10698)
- Fixed Piecewise CUDA Graph for GPTOSS (#10631)
- Fixed AutoDeploy EP sharding test (#10460)
- Fixed the nvfp4 fused_moe in AutoDeploy (#10727)
- Added quantization check for DeepEP LL low precision combine in new MoE comm API (#10072)
- Fixed AIPerf issue (#10666)
- Disabled TinyGEMM PDL due to accuracy issues (WAR) (#10619)
- Only keep a limited number of performance statistic data (#10569)
- Convert to CUDA tensor before calling _resmooth_kernel (#10770)
-
Test & Infra
- Added hang detection for executor loop and worker (#10480)
- Implemented bot to send performance regression messages to Slack channel (#10489)
- Made model initialization more general and support weights loading in layer-wise benchmarks (#10562)
- Updated trtllm-gen to support groupsTokensHeadsQ (#10261)
- Added support to export data in trtllm-eval (#10075)
- Added Torch extension API for FusedAddRMSNormQuant kernel (#9905)
- Enabled ray tests (#10272)
- Prevented flaky failures in C++ test_e2e.py by using local cached datasets (#10638)
- Enabled partial reuse in Gemma and GPT OSS test (#10559)
What's Changed
- [TRTLLM-10195][feat] K-EXAONE support by @yechank-nvidia in #10355
- [None][test] update core test list by @crazydemo in #10538
- [#8391][chore] removed llama and added deepseek to AutoDeploy's L0 perf test by @MrGeva in #10585
- [TRTLLM-10022][feat] Add hopper xqa decode support for skip softmax attention by @pengbowang-nv in #10264
- [None][chore] update waive list by @jieli-matrix in #10577
- [None][feat] Add ExpertStatistic and DUMMY_ALLREDUCE for configurable_moe by @qiaoxj07 in #10401
- [TRTLLM-10248][feat] Support Bot to Send Perf Regression Msg to Slack Channel by @chenfeiz0326 in #10489
- [None][chore] update deepseekv3.2 test parameter by @yingguo-trt in #10595
- [None][test] Remove most TRT-backend test cases in llm_perf_nim.yml by @yufeiwu-nv in #10572
- [https://nvbugs/5794796][chore] waive test blocking premerge by @dc3671 in #10593
- [None][fix] Solve pillow version conflict by @Wanli-Jiang in #10537
- [TRTLLM-9522][test] cover LLM API
multi_modal_embeddingsby @ixlmar in #9963 - [None][infra] Waive failed tests for main 01/12 by @EmmaQiaoCh in #10604
- [#10580][fix] re-enable NemotronH MOE MMLU test by @suyoggupta in #10594
- [https://nvbugs/5761391][fix] Use correct model names for config database regression tests by @anish-shanbhag in #10192
- [None][chore] Print correct backend name in benchmark report by @galagam in #10597
- [https://nvbugs/5689235][fix] Fix cancellation+chunked prefill+disagg by @Tabrizian in #10111
- [https://nvbugs/5762336][fix] support to parse the keyword modules_to_not_convert of the HF model config" by @xxi-nv in #10527
- [None][chore] Fix disagg assert by @fredricz-20070104 in #10596
- [TRTLLM-10271][test] Add Spark QA functional and performance cases by @JennyLiu-nv in #10564
- [None][infra] try removing shared cache dir mount by @tburt-nv in #10609
- [None][infra] Update allowlist 2026.01.08 by @niukuo in #10535
- [None][feat] Hang detection for executor loop and worker. by @yuxianq in #10480
- [TRTLLM-8462][feat] Support GET/DELETE v1/responses/{response_id} by @JunyiXu-nv in #9937
- [TRTLLM-10060][feat] Enable attention dp for Nemotron Super v3. by @nv-guomingz in #10347
- [https://nvbugs/5788127][fix] Use uint64_t as the dtype of lamport_buffer_size to avoid overflow by @yilin-void in #10499
- [NVBUG-5670458][chore] Unwaive lp tests by @hchings in #10524
- [TRTLLM-8425][doc] document Torch Sampler details by @ixlmar in #10606
- [None][feat] Layer-wise benchmarks: make model init more general and support weights loading by @yuantailing in #10562
- [None][test] Unwaive qwen3 next test case. by @nv-guomingz in #9877
- [None][feat] add fp4 gemm + allreduce by @benzh-2025 in #9729
- [None][infra] support overriding nspect version by @niukuo in #10402
- [https://nvbugs/5772396][fix] WAR: Disable TinyGEMM PDL due to accuracy issues by @dongfengy in #10619
- [None][feat] AutoDeploy: refactor memory usage logging by @nzmora-nvidia in #8505
- [#9283][feat] AutoDeploy: separate rms pattern detection from fusion by @Fridah-nv in #9969
- [https://nvbugs/5791900][fix] Fix HelixCpMnnvlMemory init with PP by @brb-nv in #10533
- [None][chore] Add test configurable moe module by @leslie-fang25 in #10575
- [https://nvbugs/5781589][fix] Implement pp skip forward for all spec workers. by @yuxianq in #10578
- [None][fix] Avoid write-write race for async pp send. by @yuxianq in #10488
- [https://nvbugs/5753788][chore] Padding empty chunk for configurable moe by @leslie-fang25 in #10451
- [TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #10589
- [None][chore] update allowlist 2026-01-13 by @tburt-nv in #10645
- [None][test] add test into qa test list by @xinhe-nv in #10627
- [None][test] Spark - Change testlist name and perf yml format by @JennyLiu-nv in #10626
- [None][chore] waive the CI failure by @xxi-nv in #10655
- [None][refactor] Unify the usage of MPIDist and TorchDist. by @yuxianq in #10380
- [None][fix] Reduce host over...
v1.2.0rc8
Highlights
-
Model Support
- Add export patch for GraniteMoe MoE models to enable torch.export compatibility (#10169)
- Eagle: qwen2 capture hidden states (#10091)
- Add pp support for DeepSeek-v3.2 (#10449)
- Pass lora_params through Qwen2/3 model forward (#10174)
- Fix export for microsoft/Phi-3-medium-128k-instruct (#10455)
- Mistral large 3 few code refine (#10405)
- EPD for Qwen3 VL (#10470)
- Remove some model support; add device constraint (#10563)
- Enable AttentionDP on Qwen3-VL and fix test (#10435)
-
API
- Add stability tags for serve subcommand (#10012)
-
Feature
- Better align MLA chunking with indexer chunking when chunked prefill enabled for DSV32 (#10552)
- Sm100 weight-only kernel (#10190)
- AutoTuner Cache: Support cache file lock and merge all ranks into one (#10336)
- Apply AutoTuner to AllReduce Op for strategy tuning (#8531)
- Add transferAgent binding (step 1) (#10113)
- Add the eos tokens in generation config to stop words in the sampler (#10389)
- Apply fusion for W4AFP8_AWQ MoE (#9838)
- Further reduce tuning time for cuteDSL nvFP4 dense gemm (#10339)
- Run sample_async on extra stream (#10215)
- Optimize qk rope/nope concat for DSA (#10571)
-
Fix
- Fix bug of Mistral-Small-3.1-24B-Instruct-2503 (#10394)
- Use 0 port as arbitrary port when disagg service discovery is enabled (#10383)
- Fix buffer reuse for CUDA graph attention metadata (#10393)
- Force release torch memory when LLM is destroyed (#10314)
- Swap TP-CP grouping order (#10350)
- TRTLLM MoE maps to lower tuning buckets when ep>1 (#9998)
- Fix draft token tree chain crash and depth=1 corner case (#10386, #10385)
- Fixed recursive node traversals (#10379)
- Fix undefined tokens_per_block (#10438)
- Skip spec dec for non-last rank (#10445)
- Setup dist before using autotuner (#10491)
- Fix broken cast (#9975)
- Fix sm120 speculation (#10049)
- Fix mamba_cache_manager when enabling cuda_graph_padding and let test cover this case (#9873)
- Choose register model config over root config for VLM (#10553)
-
Documentation
- Update SWA + spec dec support matrix (#10421)
- Add --config preference over --extra_llm_api_options in CODING_GUIDELINES.md (#10426)
- Adding parallelism types in feature combination matrix (#9849)
- Update GPTOSS Doc (#10536)
- Blog: Optimizing DeepSeek-V3.2 on NVIDIA Blackwell GPUs (#10565)
- Update Qwen3-Next doc by adding known issues section (#10582)
-
Test & Infra
- Add tests for DeepSeek v3.2 (#10561)
- Add accuracy tests for super-v3 with multiple-gpus (#10234)
- Layer-wise benchmarks: support TEP balance, polish slurm scripts (#10237)
- Add disag-serving kimi k2 thinking tests (#10357)
- Partition test_llm_pytorch.py for parallel execution (#10400)
- Only Use Throughput Metrics to Check Regression (#10404)
- Add vswa test cases coverage (#10146)
- Use random port in container port section (#10432)
- Remove redundant retries while binding to arbitrary port (#10452)
- Add qwen3-4b accuracy test case (#10382)
- Update kimi-k2-1k1k dataset (#10473)
- Fix concurrency list in Wide-EP perf tests (#10529)
- Restrict max_num_tokens in disagg mtp config (#10442)
- Add kimi_k2 single node perf test (#10436)
- Add MMMU test for mistral small (#10530)
- Workaround OCI-NRT slowdown issue (#10587)
What's Changed
- [#8391][chore] added deepseek_r1_distill_qwen_32b AutoDeploy perf test to L0 by @MrGeva in #10377
- [https://nvbugs/5670469][fix] Filter 0s and choose min of kv_head for Nemotron model by @farazkh80 in #10206
- [https://nvbugs/5772363][fix] fix bug of Mistral-Small-3.1-24B-Instruct-2503 by @byshiue in #10394
- [https://nvbugs/5649010][fix] use 0 port as arbitrary port when disagg service discovery is enabled by @reasonsolo in #10383
- [TRTLLM-10065][feat] Add accuracy tests for super-v3 with multiple-gpus by @Wanli-Jiang in #10234
- [https://nvbugs/5779534][fix] fix buffer reuse for CUDA graph attention metadata by @lfr-0531 in #10393
- [None][feat] sm100 weight-only kernel by @Njuapp in #10190
- [https://nvbugs/5701425][chore] Unwaive tests. by @yuxianq in #10269
- [None][feat] Layer-wise benchmarks: support TEP balance, polish slurm scripts by @yuantailing in #10237
- [None][infra] Waive failed cases in post-merge on 1/5 by @EmmaQiaoCh in #10399
- [TRTLLM-10185][feat] AutoTuner Cache: Support cache file lock and merge all ranks into one by @hyukn in #10336
- [TRTLLM-8242][feat] Add stability tags for serve subcommand by @LinPoly in #10012
- [https://nvbugs/5752521][fix] Unwaive test_trtllm_flashinfer_symbol_collision.py by @yihwang-nv in #10227
- [None][infra] Waive failed cases again on 1/5 by @EmmaQiaoCh in #10403
- [https://nvbugs/5715568][fix] Force to release torch memory when LLM is destroyed by @HuiGao-NV in #10314
- [TRTLLM-8821][feat] Apply AutoTuner to AllReduce Op for strategy tuning. by @hyukn in #8531
- [None][feat] update deepgemm to the DeepGEMM/nv_dev branch by @lfr-0531 in #9898
- [TRTLLM-9381][test] add disag-serving kimi k2 thinking tests by @xinhe-nv in #10357
- [#10374][fix] fixed race condition in AutoDeploy's mp tests port acquisition by @MrGeva in #10366
- [TRTLLM-9465][fix] Swap TP-CP grouping order by @brb-nv in #10350
- [None][perf] TRTLLM MoE maps to lower tuning buckets when ep>1 by @rosenrodt in #9998
- [TRTLLM-10053][feat] AutoDeploy: Add Super v3 config file, improve test runtime by @galagam in #10397
- [https://nvbugs/5772521][fix] Fix draft token tree chain crash by @mikeiovine in #10386
- [https://nvbugs/5772414][fix] Fix draft token tree depth=1 corner case by @mikeiovine in #10385
- [TRTLLM-9767][feat] Fixed recursive node traversals by @greg-kwasniewski1 in #10379
- [TRTLLM-9551][infra] Partition test_llm_pytorch.py for parallel execution by @Superjomn in #10400
- [https://nvbugs/5695984][fix] Unwaive llama3 eagle test by @mikeiovine in #10092
- [https://nvbugs/5745152][fix] Unwaive gpt oss spec decode test by @mikeiovine in #10370
- [#10170][fix] Add export patch for GraniteMoe MoE models to enable torch.export compatibility by @karthikvetrivel in #10169
- [https://nvbugs/5777044][chore] Remove solved bugs from waives.txt by @SimengLiu-nv in #10422
- [None][feat] precompiled installation from local src dir by @lucaslie in #10419
- [TRTLLM-9527][feat] Add transferAgent binding (step 1) by @chuangz0 in #10113
- [None][fix] Only Use Throughput Metrics to Check Regression by @chenfeiz0326 in #10404
- [None][feat] add the eos tokens in generation config to stop words in the sampler by @JadoTu in #10389
- [None][chore] Update SWA + spec dec support matrix by @mikeiovine in #10421
- [None][feat] CuteDSL MOE FC1 Enhancement by @liyuhannnnn in #10088
- [https://nvbugs/5726962][feat] Apply fusion for W4AFP8_AWQ MoE by @yumin066 in #9838
- [#2511][fix] eagle: qwen2 capture hidden states by @XiaoXuan42 in #10091
- [None][docs] Add
--configpreference over--extra_llm_api_optionsin CODING_GUIDELINES.md by @venkywonka in #10426 - [#8460][feat] Revive and simplify Model Explorer visualization integration by @karthikvetrivel in #10150
- [None][chore] unwaive qwen3 30b test by @kris1025 in #10115
- [TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #10384
- [None][test] update test case constraint by @crazydemo in #10381
- [https://nvbugs/5769926] [fix] Add no container mount home WAR by @kaiyux in #10431
- [TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #10367
- [TRTLLM-9622][infra] Enable DGX_B300 multi-gpu testing in pre-merge pipeline by @yiqingy0 in #9699
- [TRTLLM-9896][test] add vswa test cases coverage by @crazydemo in #10146
- [None] [fix] Fix undefined tokens_per_block by @kaiyux in #10438
- [https://nvbugs/5772361][ci] Unwaive tests that have been fixed by @2ez4bz...