Releases · NVIDIA/TensorRT-LLM

24 Feb 19:35

pcastonguay

v1.3.0rc5

630fccb

v1.3.0rc5 Pre-release

Pre-release

Highlights

Model Support
- Add support for Qwen3.5 with AutoDeploy (#11394)
- Read mamba_ssm_cache_dtype from HF config when set to auto (#11582)
- Add NVFP4 dynamic quantization support for visual_gen models (#11563)
API
- Use new index API; add block scale support; fix max sequence length estimation; add flash MLA support (#11334)
- Add dynamic LLMAPI defaults system (#11035)
- Use smg-grpc-proto package for gRPC proto definitions (#11578)
- Move SaveHiddenStates spec-dec mode to one model (#11241)
Feature
- Add cache transfer setup for Mamba states (#10934)
- Optimize MoE export by tracing with reduced experts and expanding graph (#11504)
- Add new Helix kernels for MNNVL-based codepath (#11433)
- Add line_profiler tool for host overhead analysis (#11232)
- Enable multi-stream MoE; add multi-stream MLA attention (#11520)
- Add MoE all-to-all paradigm (#10985)
- Add support for multi instances in Triton backend with PyTorch backend (#11153)
- Add KV cache metrics to MetricsCollector for more Prometheus metrics (#11243)
- Account for reusable KV cache blocks in capacity calculation (#11490)
- Add CUDA graphs, torch compile, NVTX, and warmup for Visual Gen (#11554)
- Make preprocessing async (#11459)
- Split up TorchSampler.Store (#11566)
Fix
- Fix multimodal placeholder counts (#11461)
- Add cacheSaltID property to BlockKey serialization (#11457)
- Fix cache transceiver (#11409)
- Declare the variable in the correct scope (#11066)
- Fix spec-dec mode flag and related C++ requirements (#10996)
- Fix Qwen3-VL-Dense/MoE accuracy drop (#11134)
- Complete WAR for popen in QA env (#11214)
- Improve error message for mismatched MPI world size (#11294)
- Use the torch_dtype set by ModelOpt (#11525)
- Fix silent MPI failures on models with custom tokenizers (#11399)
- Fix Nemotron issues (#11425)
- Fix pipeline parallelism + disaggregated serving (#11509)
- Fix broken LLMAPI config (#11571)
- Fix illegal memory access with Helix CP=64 (#11593)
- Validate requests outside sampling loop (#11584)
- Correct chunked prefill handling in TorchSampler (#11544)
- Fix SpecDec sampling seed (#11081)
- Prevent NIXL agent name collision in containerized disaggregated serving (#11552)
Documentation
- Add doc for TRTLLM AIGV initial release (#11489)
- Update hardware support (#10719)
- Add documentation on configuring CPU affinity in TRT-LLM (#10678)
- Add warning about 2-model MTP deprecation (#11043)
- Update media file paths in Skip Softmax blog (#11540)
- Update TAVA architecture diagrams for visual gen flow and auto deploy flow (#11523)
- Add Qwen3.5 and GLM 4.7 Flash to support matrix (#11594)
Benchmark
- Add ctx-only and gen-only disaggregated perf tests (#11361)
Test & Infra
- Add CUTEDSL MoE backend for DeepSeek R1 NVFP4 checkpoint in stress test (#10920)
- Update MIG tests (#11014)
- Fix Slurm job name (#11265)
- Ensure TorchSampler does not sync (#11508)
- Revert MoE unit tests refactor: add unified ConfigurableMoE test framework (#11532)
- Re-upgrade GHA for blossom-ci workflow (#11483)
- Stop using remotes in the Conan install build step (#11516)
- Update PLC pipeline (#11547, #11597)
- Fix testdb file for l0_b200_multi_gpus_perf_sanity (#11603)
- Add visual_gen CODEOWNERS paths (#11606)

What's Changed

[None][chore] Adjust waive to avoid sm parsing by @tburt-nv in #11518
[None][chore] Optimize MOE export by tracing with reduced experts and expanding graph by @suyoggupta in #11504
[#11170][fix] Fix for mm placeholder counts by @2ez4bz in #11461
[None][feat] Add new helix kernels for MNNVL-based codepath by @brb-nv in #11433
[TRTLLM-11016][fix] Add cacheSaltID property to BlockKey serialization code by @thorjohnsen in #11457
[https://nvbugs/5880261][fix] fix cacheTransceiver by @chuangz0 in #11409
[None][doc] Add doc for TRTLLM AIGV initial release by @chang-l in #11489
[TRTLLM-10851][feat] Add line_profiler tool for host overhead analysis. by @hyukn in #11232
[None][chroe] Mass integration of release/1.2 - 4th by @dominicshanshan in #11500
[None][feat] Use new index api, add block scale support, fix max_seq_len esitmation, add flash mla support by @yizhang-nv in #11334
[#11455][bug] Use the torch_dtype set by ModelOpt by @tcherckez-nvidia in #11525
[#10345][perf] Enable multi-stream MOE for super. Also adds multi-stream MLA attn by @suyoggupta in #11520
[TRTLLM-10030][test] ensure that TorchSampler does not sync by @ixlmar in #11508
[None][revert] - Revert "[TRTLLM-9108][feat] refactor MoE unit tests: add unified ConfigurableMoE test framework" by @chzblych in #11532
[None][fix] Better error message for mismatched MPI world size by @jthomson04 in #11294
[#11109][feat] AutoDeploy: GLM 4.7 Flash Improvements by @bmarimuthu-nv in #11414
[None][doc] Update media files path in Skip Softmax blog. by @bobboli in #11540
[#11318][infra] AutoDeploy: Add fused rope kernel - triton_rope_on_interleaved_qk_inputs by @bmarimuthu-nv in #11327
[None][chore] Waive failing pre-merge test by @brb-nv in #11551
[None][chore] Waive moe fp4 test by @brb-nv in #11558
[None][chore] Bump version to 1.3.0rc5 by @yuanjingx87 in #11557
[TRTLLM-10845][feat] Add dynamic llmapi defaults system by @venkywonka in #11035
[https://nvbugs/5888464][fix] Stop using remotes in the Conan install build step by @tburt-nv in #11516
[None][chore] TAVA architecture diagram updates for visual gen flow and auto deploy flow by @yibinl-nvidia in #11523
[TRTLLM-10064][feat] MoE all-to-all paradigm by @greg-kwasniewski1 in #10985
[TRTLLM-8263][feat] Add ctx-only and gen-only Disagg Perf Tests by @chenfeiz0326 in #11361
[TRTLLM-10037][chore] Re-upgrade GHA for blossom-ci workflow by @dpitman-nvda in #11483
[None][feat] Add support for multi instances in Triton backend with pytorch backend by @achartier in #11153
[None][fix] Fix silent MPI failures on models with custom tokenizers by @jthomson04 in #11399
[None][infra] PLC pipeline update by @yuanjingx87 in #11547
[TRTLLM-10827][feat] Add KV Cache metrics to MetricsCollector for more Prometheus metrics by @yijingl-nvidia in #11243
[https://nvbugs/5880313][fix] Fix pp + disagg by @Tabrizian in #11509
[None][infra] Waive unittest that consistently timed out by @yuanjingx87 in #11580
[TRTLLM-1543][feat] Account for reusable KV cache blocks in capacity … by @SimengLiu-nv in #11490
[None][feat] Visual Gen: add cuda graphs; torch compile; nvtx; warmup by @NVShreyas in #11554
[TRTLLM-9040][perf] Make preprocessing async by @2ez4bz in #11459
[#11440] [feat] AutoDeploy : Support Qwen3.5 by @bmarimuthu-nv in #11394
[#11292][feat] use smg-grpc-proto package for gRPC proto definitions by @CatherineSue in #11578
[None][doc] Add Qwen3.5, GLM 4.7 Flash to support matrix by @bmarimuthu-nv in #11594
[None][feat] AutoDeploy: Add nemotron v2 acc test by @nvchenghaoz in #11429
[#11569][fix] Fix broken LLMAPI config by @2ez4bz in #11571
[None][chore] split up TorchSampler.Store by @ixlmar in #11566
[None][fix] Read mamba_ssm_cache_dtype from HF config when set to auto by @tomeras91 in #11582
[https://nvbugs/5914959][fix] Fix illegal memory access with Helix CP=64 by @brb-nv in #11593
[#10243][feat] Add TRT-LLM attention backend to AutoDeploy by @MrGeva in #11430
[TRTLLM-10857][chore] Move SaveHiddenStates spec dec mode to 1 model by @mikeiovine in #11241
[TRTLLM-10197][feat] Cache Transfer Setup for Mamba States by @NVShreyas in #10934
[TRTLLM-11069][fix] validate requests outside sampling loop by @ixlmar in #11584
[None][fix] correct chunked prefill handling in TorchSampler by @ixlmar in #11544
...

Contributors

achartier, mikeiovine, and 33 other contributors

Assets 2

17 Feb 21:04

pcastonguay

v1.3.0rc4

26901e4

v1.3.0rc4 Pre-release

Pre-release

Highlights

Model Support
- Add EPD disagg support for Qwen3 VL MoE (#10962)
- MLA revisited and GLM 4.7 Flash support (#11324)
- Initial support of AIGV models in TRTLLM (#11462)
- Fix weight loading for Nemotron 3 models on DGX Spark (#11405)
API
- Add user-provided UUID support for multimodal KV cache identification (#11075)
Feature
- Support GB200 and increase disagg test timeout (#11019)
- Avoid syncs in beam search and other improvements (#11349)
- Implement disaggregated harmony chat (#11336)
- Support different KV cache layout for one-model spec dec (#10502)
- Reduce attention module repeated warnings (#11335)
- Make update_weights compatible with CUDA Graph (#11267)
- Fully non-blocking pipeline parallelism executor loop (#10349)
- Move MambaCacheManager from Python to C++ (#10540)
- Pin host memory and batch sampler setup in beam search (#11390)
- Initial PR for trtllm-gen attention backend (#10784)
- Remove the hard code for activation type definition in TRTLLM Moe Backend (#11164)
- Merge residual+hidden into layer norm at the end of each NemotronH MTP, and remove a % operation (#11406)
- Introduce an abstract WaitingQueue interface to decouple the request scheduling logic from specific queue implementations (#11330)
- Add BOLT compatible build flags for further experimental usage (#11297)
- Multi-image support for EPD disagg (#11264)
- Optimize NemotronH model with elementwise and nvfp4 fusion (#11273)
- TorchSampler general host time optimization (#11141)
Fix
- Disaggregated serving: Only send finished context requests to the KV cache transceiver (#11354)
- Replace etcd3 with etcd-sdk-python (#10886)
- Fix offset calculation in _are_stop_words when using speculative decoding (#10854)
- Fix hang issue by avoid exposing UB buf… (#10842)
- WAR for popen in QA env (#10989)
- Fix Eagle3 draft model weight loading for throughput checkpoint (#11010)
- Respect CUDA_LAUNCH_BLOCKING by fixing doCheckError (#11261)
- Avoid reserved filename on Windows (#11382)
- Fix tinygemm accuracy (#11411)
- Disable cutedsl argmax kernel to fix perf regression (#11403)
- Fix deepeplowlatency with trtllm moe backend running fp8 DS_R1 (#11266)
- Gracefully terminate disagg serving servers to prevent leftover subprocess warnings (#11395)
- Update lm_eval to 4.9.10 and re-enable Skip Softmax Attention tests on CI (#11176)
- Remove overlap scheduler adjustment for max sequence length in create_py_executor function (#9229)
- Fix out-of-bounds array access in kernel factory Get() methods (#11373)
- Fix a bug in PR11336 (#11439)
- Fix GLM engine build dtype (#11246)
- Enable warmup for Helix CP (#11460)
- Pre-Allocation for Auto-Tuning NCCL_SYMMETRIC (#11326)
- Make NVML work with older CUDA driver versions (#11465)
- Fallback to triton_ssm for nvfp4 quantization (#11456, #11455)
- Fix CUDA OOM error (#11219)
Documentation
- Add CLAUDE.md and AGENTS.md (#11358)
- Add multiple-instances section in disaggregated serving doc (#11412)
- Update Skip Softmax attention blog (#11443)
- Add SECURITY.md file to TensorRT-LLM GitHub (#11484)
- Enable Deepwiki docs (#11492)
Benchmark
- Add microbench for MoE Comm methods (#10317)
- Enhance multi-GPU tests for IFB stats (#11239)
- Add DGX-Spark multinode perf cases including eagle3 (#11184)
- Add gpt-oss-120b-Eagle3-throughput case on DGX-Spark (#11419)
Test & Infra
- Move test_trtllm_flashinfer_symbol_collision.py to tests/unittest/_torch (#11168)
- Fix missing test cases (#10881)
- Update test constraint (#11054)
- Add CODEOWNERS coverage for serve/ and commands/ directories (#11359)
- Update model list (#11364)
- Unit test for disagg gen cancellation (#11108)
- Disable spark stages due to migration of spark cloud (#11401)
- Enable sparck ci since spark cloud migration is done (#11407)
- Upload unittest sub results in slurm (#10834)
- Remove obsolete code (#11388)
- Fix the testcase name in timeout xml (#9781)
- Use frontend dgx-h100 and b200 slurm platforms (#11251)
- Update allowlist 2026-02-10 (#11426)
- Lock FI version to 0.6.3 (#11371)
- Pin the torchao version (#11444)
- Refactor finish reasons tests (#11445)
- Move DeepEPLowLatency tests to machines that support IBGDA with GPU handles (#11178)
- Refactor MoE unit tests: add unified ConfigurableMoE test framework (#11437)
- Use weakref in atexit handler (#11476)
- Improve assert in sampler (#11475)
- Update allowlist 2026-02-13 (#11512)

What's Changed

[None][infra] Waive failed case for main branch on 02/09 by @EmmaQiaoCh in #11369
[None][chore] Move test_trtllm_flashinfer_symbol_collision.py to tests/unittest/_torch by @yihwang-nv in #11168
[None][chore] Add microbench for MoE Comm methods. by @bobboli in #10317
[https://nvbugs/5829097][fix] Disaggregated serving: Only send finished context requests to the KV cache transceiver by @Funatiq in #11354
[None][test] Enhance multi-GPU tests for IFB stats by @Funatiq in #11239
[https://nvbugs/5834212][chore] unwaive test_disaggregated_mixed by @reasonsolo in #11372
[#10780][feat] AutoDeploy: Support per-expert scales in FP8 and NVFP4 MoE by @galagam in #11322
[TRTLLM-10030][perf] avoid syncs in beam search + other improvements by @ixlmar in #11349
[None][chroe] Mass integration of release/1.2 - 3rd by @dominicshanshan in #11308
[None][fix] Respect CUDA_LAUNCH_BLOCKING by fixing doCheckError by @hnover-nv in #11261
[TRTLLM-10866][feat] implement disaggregated harmony chat by @reasonsolo in #11336
[None][infra] AutoDeploy: Dump graph IR after every transform by @bmarimuthu-nv in #11045
[None][chore] update model list by @tcherckez-nvidia in #11364
[None][chore] Unit test for disagg gen cancellation by @pcastonguay in #11108
[https://nvbugs/5853997][chore] Unwaive gpt-oss test by @mikeiovine in #11287
[TRTLLM-10321][feat] Support different KV cache layout for one-model spec dec by @ziyixiong-nv in #10502
[https://nvbugs/5855540][fix] AutoDeploy: thread cleanup of eagle test by @lucaslie in #11289
[None][chore] Reduce attention module repeated warnings. by @yuxianq in #11335
[https://nvbugs/5843112][chore] Unwaive ngram test by @mikeiovine in #11320
[None][test] Add DGX-Spark multinode perf cases by @JennyLiu-nv in #11184
[None][fix] Avoid reserved filename on Windows by @tongyuantongyu in #11382
[None][infra] Disable spark stages due to migration of spark cloud by @EmmaQiaoCh in #11401
[TRTC-265][chore] Add CODEOWNERS coverage for serve/ and commands/ directories by @venkywonka in #11359
[#11032][feat] MLA revisited and GLM 4.7 Flash support by @lucaslie in #11324
[TRTC-264][doc] Add CLAUDE.md and AGENTS.md by @venkywonka in #11358
[None][chore] Mass merge commits from release/1.2.0rc6.post1 branch by @longlee0622 in #11384
[TRTLLM-9771][feat] Make update_weights compatible with CUDA Graph by @shuyixiong in #11267
[None][infra] Enable sparck ci since spark cloud migration is done by @EmmaQiaoCh in #11407
[None][doc] add multiple-instances section in disaggregated serving doc by @reasonsolo in #11412
[None][feat] Fully non-blocking pipeline parallelism executor loop. by @yuxianq in #10349
[None][infra] Waive failed cases for main branch on 02/10 by @EmmaQiaoCh in #11413
[None][chore] Unwaive tests after last MI by @dominicshanshan in #11400
[TRTLLM-10331][infra] Upload unittest sub results in slurm by @yiqingy0 in #10834
[https://nvbugs/5791242][chore] remove obsolete code by @ixlmar in #11388
[None][fix] fix tinygemm accuracy by @bo-nv in #11411
[https://nvbugs/5853720][fix] Disable cutedsl argmax kernel to fix perf regression by @chenfeiz0326 in #11403
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #11363
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #11396
[TRTLLM-9711][infra] Fix the testcase name in timeout xml by @yiqingy0 in #9781
[https://nvbugs/5848377][fix] fix deepeplowlatency with trtllm moe backend running fp8 DS_R1 by @leslie-fang25 in #11266
[TRTLLM-10273][feat] Move MambaCa...

Contributors

reasonsolo, longlee0622, and 47 other contributors

Assets 2

12 Feb 19:48

pcastonguay

v1.3.0rc3

b464c75

v1.3.0rc3 Pre-release

Pre-release

Highlights:

Model Support
- Support LoRa BF16 checkpoints with Llama 3.3-70B FP8 (#9808)
- Add Eagle3 support for Nemotron H (#11131)
- Enhance support for complex models (#11254)

API
- Allow overriding quantization configs (#11062)
- Set continuous_usage_stats default to False to follow OpenAI protocol (#10644)
- Set max_num_tokens_in_buffer default based on max_seq_len/max_input_len (#11082)

Feature
- Export ONNX for DriveOS LLM (#10117)
- Add L2 norm pattern matcher and fusion transform (#10767)
- Add PDL support for moeAlltoAllKernels (#10591)
- Integrate KVCacheManager V2 into TRTLLM runtime (#10659)
- Integrate cuda.tile RMS norm kernels (#9725)
- Refactor request fetching logic for better separation of concerns (#10988)
- Implement gen-first disagg_service (#11020)
- Support disagg SLURM job rescheduling (#11218)
- Improve layer classification for sharding (#10718)
- Add priority-based KV cache offload filtering (#10751)
- Optimize beam search performance (remove GPU sync, fix batching, refactor) (#11276)
- Avoid sync in PyTorchModelEngine when using beam search (#11341)
- Adjust DeepGEMM tuning buckets for larger num_tokens scope (#11259)
- Add CuteDSL FP8 GEMM for Blackwell (#10130)
- Reduce host memory usage during model loading (#11119)
- Perfect routing for Deepseek models (#11127)
- Modularize transceiver for KV manager v2 (step 4) (#11225)

Fix
- Fix AttributeError with return_perf_metrics on TensorRT backend (#10662)
- Prevent routing context and generation requests to the same worker; document unique disagg ID (#11095)
- Prevent out-of-bounds read (#10868)
- Add __syncthreads() to TinyGEMM to resolve intermittent accuracy issues (#10873)
- Fix PD disaggregation for VLMs that use mrope (#10865)
- Always reset drafting states for GuidedDecoder (#10899)
- Use NCCL as fallback to avoid crash due to insufficient memory (#10928)
- Fix llama sm120 spec decoding (#10765)
- Fix MTP one-model sampler (#10369)
- Align kv_scales with ModelOpt HF checkpoint (#10745)
- Fix selective_state_update perf regression for T=1 decode path (#11194)
- Make health_generate work with beam search (#11097)
- Work around accuracy issue by enforcing paged_context_fmha on Hopper for fmha_v2 (#11192)
- Fix CuteDSL argmax on sm120 (#11181)
- Fix amax to avoid NaN issue in fp8_blockscale_gemm_kernel (#11256)
- Fix VSWA initialization with spec-dec and boundary condition in context input preparation (#10798)
- Fix partial reuse disabled for disagg (#11247)
- Retake ownership of mrope tensors in prefill worker (#11217)
- Fix proto-to-SamplingParams conversion bugs and add gRPC tests (#11292)
- Fix accuracy drop in VSWA with KV cache block reuse (#10875)

Documentation
- Add Glm4MoeForCausalLM to model support matrix (#11156)
- Fix GLM4-MoE Eagle support documentation (#11198)
- Add CUDA Graph + LoRA to feature combination matrix (#11187)
- Fix comments for KV cache manager v2 (#11207)
- Skip Softmax Attention blog and docs (#10592)
- Add sparse attention docs to index (#11342)

Test & Infra
- Update GB200 test configs to use frontend SLURM platforms (#11085)
- Fix jaraco-context and wheel vulnerability (#10901)
- Add --high-priority in bot help message (#11133)
- Print memory usage before/after accuracy test in CI (#11155)
- Fix mocking of HuggingFace downloads in with_mocked_hf_download (#11200)
- Set rerun report stage UNSTABLE and pipeline SUCCESS when rerun tests pass (#11210)
- Move 6x H100 test stage to AIHub platform (#11039)
- Add disagg perf tests (#10912)
- Provide uniform test framework to test all MoE backends (#11128)
- Move disagg scripts env configs from bash to submit.py (#10223)
- Use free port for serve test (#10878)
- Fix test_auto_scaling for 2 GPUs (#10866)
- Update test list (#10883)
- Fix an invalid test name (#11195)
- Refine QA test list for SM120 (#11248)
- Fix multimodal serve test (#11296)
- Pass without_comm to Cutlass and DeepGEMM (#11229)
- Promote SampleState to TypeVar and fix typing (#11281)
- Fix bench script test (#10483)

What's Changed

[None][feat] Export ONNX for DriveOS LLM by @nvyocox in #10117
[#9525][feat] add L2 norm pattern matcher and fusion transform by @karthikvetrivel in #10767
[TRTINFRA-7548][infra] Update GB200 test configs to use frontend SLURM platforms by @mlefeb01 in #11085
[None][doc] Add Glm4MoeForCausalLM to model support matrix by @venkywonka in #11156
[None][feat] Perfect routing for Deepseek models by @brb-nv in #11127
[TRTLLM-10398][feat] Enable TRTLLM moe backend for Nemotron Super by @nv-guomingz in #10791
[#8242][feat] Add int4 GPTQ support for AutoDeploy by @Fridah-nv in #8248
[https://nvbugs/5804683][infra] unwaive Mistral Large3 test by @byshiue in #10680
[TRTLLM-9771][feat] Allow overriding quantization configs by @shuyixiong in #11062
[None][ci] Waive a flaky test on A10 by @chzblych in #11163
[None][infra] Waive failed cases for main on 1/30 by @EmmaQiaoCh in #11142
[None][fix] AttributeError with return_perf_metrics on tensorrt backend by @riZZZhik in #10662
[https://nvbugs/5834212][fix] prevent routing ctx and gen requests to the same worker; update doc for unique disagg ID by @reasonsolo in #11095
[TRTLLM-10666][chore] Refactor request fetching logic for better separation of concerns by @lancelly in #10988
[https://nvbugs/5823284][fix] Unwaive no repro hang issue by @liji-nv in #11138
[None] [feat] Add PDL support for moeAlltoAllKernels by @kaiyux in #10591
[None][infra] Waive failed cases and disable a stage on 02/02 by @EmmaQiaoCh in #11177
[TRTLLM-9766][feat] Integration of the KVCacheManager V2 to TRTLLM Runtime by @yizhang-nv in #10659
[None][chroe] Mass integration of release/1.2 - 2nd by @dominicshanshan in #11088
[None][feat] Integrate cuda.tile RMS norm kernels by @lirundong in #9725
[None][test] Fix an invalid test name by @chzblych in #11195
[None][feat] Nemotron H: Eagle3 support by @IzzyPutterman in #11131
[#10826][feat] AutoDeploy: Eagle One-Model [2/n]: Prefill-Only Implementation by @govind-ramnarayan in #11073
[None][doc] Fix GLM4-MoE Eagle support documentation by @venkywonka in #11198
[TRTLLM-10561][infra] Fix jaraco-context and wheel vulnerability by @yiqingy0 in #10901
[TRTLLM-10307][infra] Add --high-priority in bot help message by @mzweilz in #11133
[None][chore] Print memory usage before/after accuracy test in CI by @taylor-yb-lee in #11155
[TRTLLM-10803][fix] Fix mocking of HuggingFace downloads in with_mocked_hf_download by @anish-shanbhag in #11200
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #11193
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #11202
[TRTLLM-10839][infra] Set rerun report stage UNSTABLE and pipeline SUCCESS in post-merge when there are passed rerun tests by @yiqingy0 in #11210
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #11216
[None][fix] Align kv_scales with modelopt HF checkpoint by @cjluo-nv in #10745
[https://nvbugs/5739981][fix] unwaive tests using opt-125M by @ixlmar in #11100
[TRTLLM-10019][infra] Move 6 h100 test stage to aihub platform by @yuanjingx87 in #11039
[TRTLLM-8921][feat] implement gen-first disagg_service by @reasonsolo in #11020
[#11086][feat] Optimize Auto Deploy weight loading by preloading weights to CPU by @taylor-yb-lee in #11059
[None][fix] Set continuous_usage_stats default to False to follow OpenAI protocol by @riZZZhik in #10644
[None][chore] bump version to 1.3.0rc3 by @tburt-nv in #11238
[TRTLLM-8263][feat] Add Disagg Perf Tests by @chenfeiz0326 in #10912
[None][fix] Fix selective_state_update perf regression for T=1 decode path by @galagam in #11194
[TRTLLM-9111][feat] provide the uniform test framework to test all MoE backends by @xxi-nv in #11128
[None][fix] make health_generate work with beam search by @ixlmar in https://github.com/NVIDIA/TensorRT...

Contributors

Superjomn, reasonsolo, and 55 other contributors

Assets 2

05 Feb 02:36

chzblych

v1.2.0rc6.post3

7c6df0e

v1.2.0rc6.post3 Pre-release

Pre-release

What's Changed

[https://nvbugs/5850094][fix] Fix MoE cost estimation for auto multi-stream scheduling by @yizhang-nv in #11160
[None][feat] update TRT-LLM Gen DS FP8 MoE cubins and optimize finalize kernel by @nekorobov in #11104
[None][chore] Bump version to 1.2.0rc6.post3 by @yiqingy0 in #11224
[None][fix] Fallback to NCCL instead of NCCL symmetric by @Tabrizian in #11174
[None][feat] fuse shared to sparse experts in TRT-LLM Gen MoE by @nekorobov in #11143

Full Changelog: v1.2.0rc6.post2...v1.2.0rc6.post3

Contributors

Tabrizian, nekorobov, and 2 other contributors

Assets 2

05 Feb 02:25

chzblych

v1.2.0rc2.post2

910c070

v1.2.0rc2.post2 Pre-release

Pre-release

What's Changed

[None][fix] fix TinyGemm accuracy issue. cherry-pick #10619 and #10873 by @bo-nv in #10990
[None][chore] Bump version to 1.2.0rc2.post2 by @chzblych in #11012
[None][chore] Upgrade starlette and FastAPI (#9319) by @chzblych in #11027
[None][fix] fix accuracy issue(cherry-pick #11157 and #9530) by @bo-nv in #11222

Full Changelog: v1.2.0rc2.post1...v1.2.0rc2.post2

Contributors

chzblych and bo-nv

Assets 2

03 Feb 19:31

pcastonguay

v1.3.0rc2

f42a6cb

v1.3.0rc2 Pre-release

Pre-release

Highlights:

Known Issues
- On RTX6000D, one might encounter Instruction 'redux.f32' not supported error. This issue will be resolved in the next release.
Model Support
- Enable MTP for Nemotron Super (#10754)
- Make TRTLLM MoE the default for GPTOSS on Blackwell (#11074)
- Add missing absolute position embeddings in Qwen3-VL vision encoder (#11065)
API
- Change context params and disagg params (#10495)
- Add KVCacheManagerV2 APIs for Transceiver (#11003)
Feature
- Add Skip Softmax MLA kernels for Blackwell and fix NVFP4 KV accuracy bug (#10813)
- Fuse AllGather for expert statistics required by EPLB (#10885)
- Add first-iteration streaming for GPT-OSS in trtllm-serve (#10808)
- Integrate CuteDSL argmax kernel (#10476)
- Update Mamba decode kernel to FlashInfer (#10757)
- Improve effective memory bandwidth with TMA.RED (#10987)
- Reorganize AutoTuner cache file for distributed tuning (#10956)
- Support attention DP + Helix CP (#10477)
- Improve performance of _write_finish_reasons in TorchSampler (#10459)
- Add gRPC server for high-performance external router integration (#11037)
- Prepare for future KVCacheV2 MTP support (#11029)
Fix
- Fix CuteDSL MoE unit test (#10983)
- Fix overlap scheduler pause() timing (#10943)
- Fix Pydantic deepcopy bug (#11004)
- Restore IPv6 support in serve.py (#10929)
- Fix conditional compilation for sm10x cubins (#10839)
- Add graceful fallbacks for NCCL symmetric mode (#11042)
- Fix enable_alltoall passed to CutlassFusedMoE (#11016)
- Fix kvCacheManager isLeaf() assertion failure (#10922)
- Add null pointer check to parseNpyHeader (#10944)
- Fix attention DP scheduling sort order to prioritize non-relaxed requests (#11106)
Documentation
- Update Qwen2/3-VL models in supported_models.md (#10797)
Benchmark
- Add performance alignment to layer-wise benchmarks (#11018)
- Clean up layer-wise benchmarks code (#11092)
- Add DGX-Spark VLM gemm3-12b bfp16/fp4/fp8 accuracy and perf cases (#11096)
Test & Infra
- Add 250K-token NVFP4 MoE + PDL regression tests (#10911)
- Add timeout for SeedOSS test (#8683)
- Add Fake Ops for one-sided AlltoAll (#11002)
- Refactor setup for RNN cache transceiver (#10957)
- Change SLURM config access to use resolvePlatform (#11006)
- Update CI allowList (#11040)
- Add Mamba and MLA layers to sharding tests (#10364)
- Remove pybind11 bindings and references (#10550, #11026)
- Add multi-acc and Lyris GB200 test support (#11024)
- Package triton-kernels as a dependency (#10471)
- Fix Qwen3 Eagle test (#11030)
- Dump thread stacks for hanging tests before timeout (#10708)
- Remove -ccache from build_wheel.py args (#11064)
- Fix trtllm-serve guided decoding test (#11101)
- Remove invalid account for Blossom CI (#11126)
- Add source code pulse scan to PLC nightly pipeline (#10961)

What's Changed

[None][fix] Fix CuteDSL MoE unittest by @syuoni in #10983
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #10974
[https://nvbugs/5661741][feat] Add 250K-token NVFP4 MoE + PDL regression tests by @yingguo-trt in #10911
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #10976
[None][infra] Waive failed case for main branch on 01/26 by @EmmaQiaoCh in #10994
[None][feat] Add Skip Softmax MLA kernels for Blackwell and Fix an accuracy bug of NVFP4 KV by @Tom-Zheng in #10813
[TRTLLM-10048][feat] Fuse the AllGather for expert statistics required by the EPLB. by @bobboli in #10885
[https://nvbugs/5794796][fix] Cherry-pick #10855: Unwaive Llama 3.3 related multi GPU tests by @pengbowang-nv in #10942
[#10614][fix] gpt_oss first iteration streaming in trtllm-serve by @LinPoly in #10808
[None][chore] Removing pybind11 bindings and references by @Linda-Stadter in #10550
[#8982][feat] AutoDeploy attention dp support by @lucaslie in #10728
[None][chore] update AD model list by @tcherckez-nvidia in #10981
[TRTLLM-10062][feat] Enable MTP for Nemotron Super by @sunnyqgg in #10754
[TRTLLM-10276][feat] Integrate cutedsl argmax kernel by @ameynaik-hub in #10476
[TRTLLM-10453][feat] Update mamba decode kernel to flashinfer by @Wanli-Jiang in #10757
[TRTLLM-10560][fix] Fix the time of pause() for overlap scheduler by @yuantailing in #10943
[https://nvbugs/5612438][fix] Add timeout for SeedOSS test by @zhhuang-nv in #8683
[None][infra] Waive failed cases for main on 01/27 by @EmmaQiaoCh in #11017
[None][chore] Bump version to 1.3.0rc2 by @yiqingy0 in #11021
[None][chore] Remove closed bugs by @xinhe-nv in #10982
[#10889][fix] fix pydantic deepcopy bug by @reasonsolo in #11004
[TRTLLM-9390][chore] Add Fake OPs for One-Sided AlltoAll. by @bobboli in #11002
[TRTLLM-9831][perf] Use TMA.RED to improve effective memory bandwidth by @sherry-1001 in #10987
[TRTLLM-9527][feat] change context params and disagg params (step3) by @chuangz0 in #10495
[TRTLLM-10308][feat] AutoTuner Cache: reorganize cache file for distributed tuning by @hyukn in #10956
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #10993
[https://nvbugs/5843316][chore] waive overlap_scheduler test by @galagam in #11025
[#10013][feat] AutoDeploy: native cache manager integration by @lucaslie in #10635
[https://nvbugs/5721661][chore] Unwaive fixed bug. by @SimengLiu-nv in #11009
[#10877][fix] restore ipv6 support in serve.py by @Evgueni-Petrov-aka-espetrov in #10929
[TRTLLM-10197][chore] Refactor to setup for RNN cache transceiver by @NVShreyas in #10957
[TRTINFRA-7379][infra] Change SLURM config access to use resolvePlatform by @mlefeb01 in #11006
[None][fix] Proper conditional compilation of sm10x cubins by @tongyuantongyu in #10839
[https://nvbugs/5756804][fix] Re-enable passing test by @dongfengy in #10986
[None][fix] unwaive tests by @xinhe-nv in #11047
[https://nvbugs/5779536][fix] Cherry-pick #10902: Unwaive DeepSeekR1 nvfp4 pp4 mtp test case (#10902) by @pengbowang-nv in #11000
[None][infra] Update CI allowList by @yuanjingx87 in #11040
[TRTLLM-10362][feat] Added Mamba and MLA layers to the sharding tests by @greg-kwasniewski1 in #10364
[None][chore] Removing cpp/tensorrt_llm/pybind by @Linda-Stadter in #11026
[None][feat] support multi_acc and Lyris GB200 test by @yingguo-trt in #11024
[None][infra] Waive failed cases for main on 1/28 by @EmmaQiaoCh in #11053
[None][chore] AutoDeploy: Eagle One-Model [1/n]: PyTorch impl for Eagle3 checkpoint by @govind-ramnarayan in #10674
[#10245][feat] AutoDeploy: Add Minimax M2 support by @bmarimuthu-nv in #10525
[None][fix] nccl symmetric with graceful fallbacks by @nv-lschneider in #11042
[None][fix] fix Qwen2/3 export for AutoDeploy by @Fridah-nv in #11007
[None][fix] No need to remove the original waive list by @yiqingy0 in #11060
[https://nvbugs/5761391][fix] Include triton-kernels as a packaged dependency by @anish-shanbhag in #10471
[None][fix] Fix enable_alltoall passed to CutlassFusedMoE by @syuoni in #11016
[None][feat] Add performance alignment to layer-wise benchmarks by @yuantailing in #11018
[https://nvbugs/5813452][fix] Fix "Assertion failed: isLeaf() in kvCacheManager.cpp:465" by @Boreas618 in #10922
[None][infra] Waived flaky tests by @ZhanruiSunCh in #11091
[TRTLLM-10264][feat] Support attention DP + Helix CP by @brb-nv in #10477
[TRTLLM-10415][feat] Dump thread stacks for hanging tests before time… by @WeiHaocheng in #10708
[TRTLLM-10312][perf] Improve performance of _write_finish_reasons in TorchSampler by @stnie in https:/...

Contributors

reasonsolo, mikeiovine, and 51 other contributors

Assets 2

27 Jan 09:32

HuiGao-NV

v1.3.0rc1

45d7022

v1.3.0rc1 Pre-release

Pre-release

Highlights

Model Support
- GLM-4.5-Air support (#10653)
- K-EXAONE MTP support (#10796)
API
- Refactor AutoDeployConfig into LlmArgs (#10613)
- Support model_kwargs for pytorch backend (#10351)
Feature
- Update disagg slurm scripts (#10712)
- Re-implement MicroBatchScheduler and CapacityScheduler in Python (#10273)
- Fix sharding dashboard errors (#10786)
- Async Transfer Manager (#9891)
- Speculative One Model: FlashInfer sampling (#10284)
- Refactor speculative decoding workers (#10768)
- Use global unique id as disagg request id (#10187)
- Enable guided decoding with reasoning parsers (#10890)
- Support partial update weight for fp8 (#10456)
- Multi-LoRA serving with CUDA Graph (#8279)
- Support logprobs for Completions API (#10809)
- Eagle3 Specdec UX improvements (#10124)
- Python transceiver components (step 2) (#10494)
- Upgrade NIXL to v0.9.0 (#10896)
- KV Connector Support for MTP (#10932)
- Support overlap scheduler for disagg ctx instances (#10755)
- Adding implementation of KVCacheManagerV2 (#10736)
- Switch to ConfigurableMoE as the default path (#10792)
Fix
- Enable system memory to transfer active message in NIXL ucx (#10602)
- Fix the potential misaligned access due to vectorized ld/st instructions in NVLinkOneSided A2A (#10539)
- Default disable gemm+allreduce fusion (#10656)
- Fix vulnerability urllib3 and nbconvert (#10551)
- Fix overlap scheduler race condition (#10610)
- Replace pickle.load with restricted Unpickler (#10622)
- Fix copy start_logs in disagg slurm scripts (#10840)
- Cherry-pick: Disable short profile for tunable ops with MERGE strategy (#10844, #10715)
- Lock resource to fix potential access to released data (#10827)
- Cherry-pick: Fix accuracy issue of TWO-SHOT AllReduce kernel (#10841, #10654)
- Remove weight tensor holder to release memory earlier (#10876)
- Add missing dist strategy param and fix typo for ad_logger (#10892)
- Update RMSNorm custom op plumbing (#10843)
- Fix hmac launch (#10434)
- Avoid Double update for previous batch (#9888)
- Re-init TRTLLM sampler to use sample stream in multi-stream cases (#10918)
- Mtp with async scheduler (#10941)
- Fix buffer reuse (#10716)
- Cherry-pick: Fix hanging issue for MNNVL Allreduce under PP (#10750, #10633)
- Workaround for flashinfer.sampling.sampling_from_logits (#10713)
- Fix port 8000 being used issue in stress test (#10756)
Documentation
- Clarify LoRA is not supported with --use_fp8_rowwise in Fp8RowwiseAttention (see #2603) (#10320)
- Add NIXL as a Python attribution (step 4) (#10910)
- 1.2 Release Notes Headers (#10722)
Test & Infra
- Upload regression info to artifactory (#10599)
- Add sonarqube scanning in lockfile generation pipeline (#10700)
- Add Nemotron Nano v3 FP8 autodeploy perf test (#10603)
- Remove trt flow tests in NIM (#10731)
- Update config.yaml of slurm scripts to align with submit.py change (#10802)
- Add a timeout in MNNVL throughput to prevent hangs if one rank crashes (#9532)
- Trigger multi-gpu tests when install_nixl/ucx.sh is modified (#10624)
- Add DGX-Spark VLM accuracy and perf spec dec cases (#10804)
- Fix test list llm_spark_func.txt (#10921)
- Add test configurable moe module multi gpu (#10699)
- NVFP4 MoE - Move weights transformation to fusion phase (#10803)
- Update flashinfer-python to 0.6.1 (#10872)
- Improve disagg acc tests (#10833)
- Refine placement group in ray executor (#10235)
- Regenerate out dated lock file (#10940)
- Remove long-running sanity check tests on GH200 (#10924, #10969)
- Add dgx-spark beta notes (#10766)
- Modify ctx config in 128k8k disagg cases (#10779)
- Balanced random MoE workload generator for CuteDSL kernel UT, autotuner and layerwise benchmark (#10279)

What's Changed

[#10696][fix] AutoDeploy prevent torch.export from specializing batch dimension when max_batch_size=1 by @MrGeva in #10697
[None][infra] Add sonarqube scanning in lockfile generation pipeline by @yuanjingx87 in #10700
[https://nvbugs/5769712][fix] fix timeout in AutoDeploy llama accuracy test by @lucaslie in #10461
[#10688][fix] AutoDeploy Fix CUDA graph batch sizes exceeding max_batch_size by @MrGeva in #10687
[#10642][feat] AutoDeploy: optimized canonicalize_graph utilities [1/2] by @lucaslie in #10675
[https://nvbugs/5769890][fix] enable system memory to transfer active message in NIXL ucx by @chuangz0 in #10602
[https://nvbugs/5814247][fix] unwaive AutoDeploy multi-gpu unit tests by @lucaslie in #10769
[TRTLLM-10300][feat] Upload regression info to artifactory by @chenfeiz0326 in #10599
[None][chore] Add release/1.2 branch into lockfile generation schedule by @yiqingy0 in #10790
[TRTLLM-9581][infra] Use /home/scratch.trt_llm_data_ci in computelab by @ZhanruiSunCh in #10616
[None][infra] Waive failed cases for main on 01/19 by @EmmaQiaoCh in #10794
[#10607][chore] Add Nemotron Nano v3 FP8 autodeploy perf test by @MrGeva in #10603
[None][feat] Update disagg slurm scripts by @qiaoxj07 in #10712
[None][test] adjust the dis-agg test timeout threshold by @Shixiaowei02 in #10800
[None][chore] docs: clarify LoRA is not supported with --use_fp8_rowwise in Fp8RowwiseAttention (see #2603) by @ssam18 in #10320
[None][chore] Remove trt flow tests in NIM by @jieli-matrix in #10731
[None][chore] update config.yaml of slurm scripts to align with submit.py change by @dc3671 in #10802
[https://nvbugs/5776445][chore] unwaive test by @reasonsolo in #10667
[TRTLLM-10029][scheduler] Re-implement MicroBatchScheduler and CapacityScheduler in Python by @lancelly in #10273
[TRTLLM-10296][fix] Fix the potential misaligned access due to vectorized ld/st instructions in NVLinkOneSided A2A. by @bobboli in #10539
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #10776
[None][fix] default disable gemm+allreduce fusion by @benzh-2025 in #10656
[TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #10787
[None][fix] Fix vulnerability urllib3 and nbconvert by @yiqingy0 in #10551
[None][test] Update sanity test list by @xinhe-nv in #10825
[None][fix] Remove unused params in attn by @yizhang-nv in #10652
[TRTLLM-10785][feat] Fix sharding dashboard errors by @greg-kwasniewski1 in #10786
[https://nvbugs/5701445][chore] unwaive test. by @yuxianq in #10806
[None][infra] trigger multi-gpu tests when install_nixl/ucx.sh is mod… by @bo-nv in #10624
[None][infra] Waive failed cases for main branch on 01/20 by @EmmaQiaoCh in #10829
[None][chore] Reduce tedious logs by @chzblych in #10847
[#10707][fix] AutoDeploy: Super accuracy test fixes by @galagam in #10717
[None][chore] Async Transfer Manager by @jthomson04 in #9891
[None][fix] fix duplicate entry in waives.txt by @lucaslie in #10853
[None][feat] Speculative One Model: FlashInfer sampling by @IzzyPutterman in #10284
[https://nvbugs/5670108][fix] Fix overlap scheduler race condition in… by @SimengLiu-nv in #10610
[https://nvbugs/5760737][test] only skip mooncake+indexerkcache test by @zhengd-nv in #10266
[https://nvbugs/5759698][fix] unwaive test_base_worker by @Superjomn in #10669
[None][fix] Add a timeout in MNNVL throughput to prevent hangs if one rank crashes by @djns99 in #9532
[https://nvbugs/5670458][chore] Unwaive reward model test by @shuyixiong in #10831
[None][chore] Revert #10847 by @chzblych in #10869
[https://nvbugs/5775021] [fix] Replace pickle.load with restricted Unpickler by @yibinl-nvidia in #10622
[None][fix] Fix copy start_logs in disagg slurm scripts by @qiaoxj07 in #10840
[None][fix] Cherry-pick #10715: Disable short profile for tunable ops with MERGE strategy by @hyukn in #10844
[https://nvbugs/5740377][fix] Lock resource to fix potential access to released data by @HuiGao-NV in #10827
[https://nvbugs/5814253][fix] unwaive test_autotuner_di...

Contributors

Superjomn, reasonsolo, and 58 other contributors

Assets 2

22 Jan 16:50

chzblych

v1.2.0rc6.post2

50379d0

v1.2.0rc6.post2 Pre-release

Pre-release

What's Changed

[None][fix] enable EPLB for DEEPGEMM by @xxi-nv in #10618
[https://nvbugs/5811697][fix] Fix buffer reuse for release/1.2.0rc6.post1 by @yuxianq in #10734
[None][fix] impl fused triton kernel for e8m0 resmooth (target release/1.2.0rc6.post1, cherry-pick from #10327 and #10770) by @yuxianq in #10771
[None][chore] Bump version to 1.2.0rc6.post2 by @yiqingy0 in #10907

Full Changelog: v1.2.0rc6.post1...v1.2.0rc6.post2

Contributors

xxi-nv, yuxianq, and yiqingy0

Assets 2

22 Jan 08:04

HuiGao-NV

v1.3.0rc0

0af1a0e

v1.3.0rc0 Pre-release

Pre-release

Highlights

Model Support
- Added support for K-EXAONE models (#10355)
- Integrated MiniMax M2 model (#10532)
- Added Spark QA functional and performance test cases (#10564)
- Added support for new Transformers RoPE configuration format (#10636)
- Support customized sequence length larger than model config (#10600)
API Improvements
- Added processed logprobs functionality to TorchSampler (#9675)
- Added support for image_embeds in OpenAI API (#9715)
- Covered LLM API multi_modal_embeddings (#9963)
- Implemented GET/DELETE v1/responses/{response_id} endpoints (#9937)
- Use RequestError for validation errors to prevent engine shutdown (#9761)
Performance Optimizations
- Added Hopper XQA decode support for skip softmax attention (#10264)
- Enabled attention data parallelism for Nemotron Super v3 (#10347)
- Added fp4 GEMM with AllReduce support (#9729)
- Use XQA JIT implementation by default with sliding window perf optimization (#10335)
- Reduced host overhead for unified nvfp4 GEMM tuning path (#10503)
- Implemented fused Triton kernel for e8m0 resmooth to reduce memory footprint (#10327)
MoE (Mixture of Experts) Enhancements
- Added ExpertStatistic and DUMMY_ALLREDUCE for configurable MoE (#10401)
- Added test configurable MoE module (#10575)
- Implemented padding empty chunk for configurable MoE (#10451)
- Enabled EPLB for DEEPGEMM (#10617)
- Extended MoE quantization test utilities with comprehensive quant algorithm support (#10691)
Disaggregation Features
- New request states and KV cache transceiver APIs in generation-first disaggregation (#10406)
- Fixed cancellation with chunked prefill and disaggregation (#10111)
Auto Deploy
- Refactored memory usage logging in AutoDeploy (#8505)
- Separated RMS pattern detection from fusion (#9969)
- Auto download speculative models from HuggingFace for PyTorch backend (#10099)
Fixes
- Fixed PP loop hang caused by i-sending new requests (#10665)
- Avoided write-write race for async PP send (#10488)
- Fixed hang issue when enabling skip softmax on Blackwell (#10490)
- Fixed hanging issue for MNNVL Allreduce under PP (#10633)
- Implemented PP skip forward for all spec workers (#10578)
- Added warning for gen-only paused state (#10664)
- Used uint64_t as dtype of lamport_buffer_size to avoid overflow (#10499)
- Fixed HelixCpMnnvlMemory initialization with PP (#10533)
- Fixed regression in KV cache resize memory estimation (#10726)
- Prevented out-of-bounds read (#9879)
- Solved pillow version conflict (#10537)
- Support to parse the keyword modules_to_not_convert of HF model config (#10527)
- Used correct model names for config database regression tests (#10192)
- Support GuidedDecoder with sharded logits (#10698)
- Fixed Piecewise CUDA Graph for GPTOSS (#10631)
- Fixed AutoDeploy EP sharding test (#10460)
- Fixed the nvfp4 fused_moe in AutoDeploy (#10727)
- Added quantization check for DeepEP LL low precision combine in new MoE comm API (#10072)
- Fixed AIPerf issue (#10666)
- Disabled TinyGEMM PDL due to accuracy issues (WAR) (#10619)
- Only keep a limited number of performance statistic data (#10569)
- Convert to CUDA tensor before calling _resmooth_kernel (#10770)
Test & Infra
- Added hang detection for executor loop and worker (#10480)
- Implemented bot to send performance regression messages to Slack channel (#10489)
- Made model initialization more general and support weights loading in layer-wise benchmarks (#10562)
- Updated trtllm-gen to support groupsTokensHeadsQ (#10261)
- Added support to export data in trtllm-eval (#10075)
- Added Torch extension API for FusedAddRMSNormQuant kernel (#9905)
- Enabled ray tests (#10272)
- Prevented flaky failures in C++ test_e2e.py by using local cached datasets (#10638)
- Enabled partial reuse in Gemma and GPT OSS test (#10559)

What's Changed

[TRTLLM-10195][feat] K-EXAONE support by @yechank-nvidia in #10355
[None][test] update core test list by @crazydemo in #10538
[#8391][chore] removed llama and added deepseek to AutoDeploy's L0 perf test by @MrGeva in #10585
[TRTLLM-10022][feat] Add hopper xqa decode support for skip softmax attention by @pengbowang-nv in #10264
[None][chore] update waive list by @jieli-matrix in #10577
[None][feat] Add ExpertStatistic and DUMMY_ALLREDUCE for configurable_moe by @qiaoxj07 in #10401
[TRTLLM-10248][feat] Support Bot to Send Perf Regression Msg to Slack Channel by @chenfeiz0326 in #10489
[None][chore] update deepseekv3.2 test parameter by @yingguo-trt in #10595
[None][test] Remove most TRT-backend test cases in llm_perf_nim.yml by @yufeiwu-nv in #10572
[https://nvbugs/5794796][chore] waive test blocking premerge by @dc3671 in #10593
[None][fix] Solve pillow version conflict by @Wanli-Jiang in #10537
[TRTLLM-9522][test] cover LLM API multi_modal_embeddings by @ixlmar in #9963
[None][infra] Waive failed tests for main 01/12 by @EmmaQiaoCh in #10604
[#10580][fix] re-enable NemotronH MOE MMLU test by @suyoggupta in #10594
[https://nvbugs/5761391][fix] Use correct model names for config database regression tests by @anish-shanbhag in #10192
[None][chore] Print correct backend name in benchmark report by @galagam in #10597
[https://nvbugs/5689235][fix] Fix cancellation+chunked prefill+disagg by @Tabrizian in #10111
[https://nvbugs/5762336][fix] support to parse the keyword modules_to_not_convert of the HF model config" by @xxi-nv in #10527
[None][chore] Fix disagg assert by @fredricz-20070104 in #10596
[TRTLLM-10271][test] Add Spark QA functional and performance cases by @JennyLiu-nv in #10564
[None][infra] try removing shared cache dir mount by @tburt-nv in #10609
[None][infra] Update allowlist 2026.01.08 by @niukuo in #10535
[None][feat] Hang detection for executor loop and worker. by @yuxianq in #10480
[TRTLLM-8462][feat] Support GET/DELETE v1/responses/{response_id} by @JunyiXu-nv in #9937
[TRTLLM-10060][feat] Enable attention dp for Nemotron Super v3. by @nv-guomingz in #10347
[https://nvbugs/5788127][fix] Use uint64_t as the dtype of lamport_buffer_size to avoid overflow by @yilin-void in #10499
[NVBUG-5670458][chore] Unwaive lp tests by @hchings in #10524
[TRTLLM-8425][doc] document Torch Sampler details by @ixlmar in #10606
[None][feat] Layer-wise benchmarks: make model init more general and support weights loading by @yuantailing in #10562
[None][test] Unwaive qwen3 next test case. by @nv-guomingz in #9877
[None][feat] add fp4 gemm + allreduce by @benzh-2025 in #9729
[None][infra] support overriding nspect version by @niukuo in #10402
[https://nvbugs/5772396][fix] WAR: Disable TinyGEMM PDL due to accuracy issues by @dongfengy in #10619
[None][feat] AutoDeploy: refactor memory usage logging by @nzmora-nvidia in #8505
[#9283][feat] AutoDeploy: separate rms pattern detection from fusion by @Fridah-nv in #9969
[https://nvbugs/5791900][fix] Fix HelixCpMnnvlMemory init with PP by @brb-nv in #10533
[None][chore] Add test configurable moe module by @leslie-fang25 in #10575
[https://nvbugs/5781589][fix] Implement pp skip forward for all spec workers. by @yuxianq in #10578
[None][fix] Avoid write-write race for async pp send. by @yuxianq in #10488
[https://nvbugs/5753788][chore] Padding empty chunk for configurable moe by @leslie-fang25 in #10451
[TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #10589
[None][chore] update allowlist 2026-01-13 by @tburt-nv in #10645
[None][test] add test into qa test list by @xinhe-nv in #10627
[None][test] Spark - Change testlist name and perf yml format by @JennyLiu-nv in #10626
[None][chore] waive the CI failure by @xxi-nv in #10655
[None][refactor] Unify the usage of MPIDist and TorchDist. by @yuxianq in #10380
[None][fix] Reduce host over...

Contributors

reasonsolo, DomBrown, and 64 other contributors

Assets 2

15 Jan 05:40

HuiGao-NV

v1.2.0rc8

80649a8

v1.2.0rc8 Pre-release

Pre-release

Highlights

Model Support
- Add export patch for GraniteMoe MoE models to enable torch.export compatibility (#10169)
- Eagle: qwen2 capture hidden states (#10091)
- Add pp support for DeepSeek-v3.2 (#10449)
- Pass lora_params through Qwen2/3 model forward (#10174)
- Fix export for microsoft/Phi-3-medium-128k-instruct (#10455)
- Mistral large 3 few code refine (#10405)
- EPD for Qwen3 VL (#10470)
- Remove some model support; add device constraint (#10563)
- Enable AttentionDP on Qwen3-VL and fix test (#10435)
API
- Add stability tags for serve subcommand (#10012)
Feature
- Better align MLA chunking with indexer chunking when chunked prefill enabled for DSV32 (#10552)
- Sm100 weight-only kernel (#10190)
- AutoTuner Cache: Support cache file lock and merge all ranks into one (#10336)
- Apply AutoTuner to AllReduce Op for strategy tuning (#8531)
- Add transferAgent binding (step 1) (#10113)
- Add the eos tokens in generation config to stop words in the sampler (#10389)
- Apply fusion for W4AFP8_AWQ MoE (#9838)
- Further reduce tuning time for cuteDSL nvFP4 dense gemm (#10339)
- Run sample_async on extra stream (#10215)
- Optimize qk rope/nope concat for DSA (#10571)
Fix
- Fix bug of Mistral-Small-3.1-24B-Instruct-2503 (#10394)
- Use 0 port as arbitrary port when disagg service discovery is enabled (#10383)
- Fix buffer reuse for CUDA graph attention metadata (#10393)
- Force release torch memory when LLM is destroyed (#10314)
- Swap TP-CP grouping order (#10350)
- TRTLLM MoE maps to lower tuning buckets when ep>1 (#9998)
- Fix draft token tree chain crash and depth=1 corner case (#10386, #10385)
- Fixed recursive node traversals (#10379)
- Fix undefined tokens_per_block (#10438)
- Skip spec dec for non-last rank (#10445)
- Setup dist before using autotuner (#10491)
- Fix broken cast (#9975)
- Fix sm120 speculation (#10049)
- Fix mamba_cache_manager when enabling cuda_graph_padding and let test cover this case (#9873)
- Choose register model config over root config for VLM (#10553)
Documentation
- Update SWA + spec dec support matrix (#10421)
- Add --config preference over --extra_llm_api_options in CODING_GUIDELINES.md (#10426)
- Adding parallelism types in feature combination matrix (#9849)
- Update GPTOSS Doc (#10536)
- Blog: Optimizing DeepSeek-V3.2 on NVIDIA Blackwell GPUs (#10565)
- Update Qwen3-Next doc by adding known issues section (#10582)
Test & Infra
- Add tests for DeepSeek v3.2 (#10561)
- Add accuracy tests for super-v3 with multiple-gpus (#10234)
- Layer-wise benchmarks: support TEP balance, polish slurm scripts (#10237)
- Add disag-serving kimi k2 thinking tests (#10357)
- Partition test_llm_pytorch.py for parallel execution (#10400)
- Only Use Throughput Metrics to Check Regression (#10404)
- Add vswa test cases coverage (#10146)
- Use random port in container port section (#10432)
- Remove redundant retries while binding to arbitrary port (#10452)
- Add qwen3-4b accuracy test case (#10382)
- Update kimi-k2-1k1k dataset (#10473)
- Fix concurrency list in Wide-EP perf tests (#10529)
- Restrict max_num_tokens in disagg mtp config (#10442)
- Add kimi_k2 single node perf test (#10436)
- Add MMMU test for mistral small (#10530)
- Workaround OCI-NRT slowdown issue (#10587)

What's Changed

[#8391][chore] added deepseek_r1_distill_qwen_32b AutoDeploy perf test to L0 by @MrGeva in #10377
[https://nvbugs/5670469][fix] Filter 0s and choose min of kv_head for Nemotron model by @farazkh80 in #10206
[https://nvbugs/5772363][fix] fix bug of Mistral-Small-3.1-24B-Instruct-2503 by @byshiue in #10394
[https://nvbugs/5649010][fix] use 0 port as arbitrary port when disagg service discovery is enabled by @reasonsolo in #10383
[TRTLLM-10065][feat] Add accuracy tests for super-v3 with multiple-gpus by @Wanli-Jiang in #10234
[https://nvbugs/5779534][fix] fix buffer reuse for CUDA graph attention metadata by @lfr-0531 in #10393
[None][feat] sm100 weight-only kernel by @Njuapp in #10190
[https://nvbugs/5701425][chore] Unwaive tests. by @yuxianq in #10269
[None][feat] Layer-wise benchmarks: support TEP balance, polish slurm scripts by @yuantailing in #10237
[None][infra] Waive failed cases in post-merge on 1/5 by @EmmaQiaoCh in #10399
[TRTLLM-10185][feat] AutoTuner Cache: Support cache file lock and merge all ranks into one by @hyukn in #10336
[TRTLLM-8242][feat] Add stability tags for serve subcommand by @LinPoly in #10012
[https://nvbugs/5752521][fix] Unwaive test_trtllm_flashinfer_symbol_collision.py by @yihwang-nv in #10227
[None][infra] Waive failed cases again on 1/5 by @EmmaQiaoCh in #10403
[https://nvbugs/5715568][fix] Force to release torch memory when LLM is destroyed by @HuiGao-NV in #10314
[TRTLLM-8821][feat] Apply AutoTuner to AllReduce Op for strategy tuning. by @hyukn in #8531
[None][feat] update deepgemm to the DeepGEMM/nv_dev branch by @lfr-0531 in #9898
[TRTLLM-9381][test] add disag-serving kimi k2 thinking tests by @xinhe-nv in #10357
[#10374][fix] fixed race condition in AutoDeploy's mp tests port acquisition by @MrGeva in #10366
[TRTLLM-9465][fix] Swap TP-CP grouping order by @brb-nv in #10350
[None][perf] TRTLLM MoE maps to lower tuning buckets when ep>1 by @rosenrodt in #9998
[TRTLLM-10053][feat] AutoDeploy: Add Super v3 config file, improve test runtime by @galagam in #10397
[https://nvbugs/5772521][fix] Fix draft token tree chain crash by @mikeiovine in #10386
[https://nvbugs/5772414][fix] Fix draft token tree depth=1 corner case by @mikeiovine in #10385
[TRTLLM-9767][feat] Fixed recursive node traversals by @greg-kwasniewski1 in #10379
[TRTLLM-9551][infra] Partition test_llm_pytorch.py for parallel execution by @Superjomn in #10400
[https://nvbugs/5695984][fix] Unwaive llama3 eagle test by @mikeiovine in #10092
[https://nvbugs/5745152][fix] Unwaive gpt oss spec decode test by @mikeiovine in #10370
[#10170][fix] Add export patch for GraniteMoe MoE models to enable torch.export compatibility by @karthikvetrivel in #10169
[https://nvbugs/5777044][chore] Remove solved bugs from waives.txt by @SimengLiu-nv in #10422
[None][feat] precompiled installation from local src dir by @lucaslie in #10419
[TRTLLM-9527][feat] Add transferAgent binding (step 1) by @chuangz0 in #10113
[None][fix] Only Use Throughput Metrics to Check Regression by @chenfeiz0326 in #10404
[None][feat] add the eos tokens in generation config to stop words in the sampler by @JadoTu in #10389
[None][chore] Update SWA + spec dec support matrix by @mikeiovine in #10421
[None][feat] CuteDSL MOE FC1 Enhancement by @liyuhannnnn in #10088
[https://nvbugs/5726962][feat] Apply fusion for W4AFP8_AWQ MoE by @yumin066 in #9838
[#2511][fix] eagle: qwen2 capture hidden states by @XiaoXuan42 in #10091
[None][docs] Add --config preference over --extra_llm_api_options in CODING_GUIDELINES.md by @venkywonka in #10426
[#8460][feat] Revive and simplify Model Explorer visualization integration by @karthikvetrivel in #10150
[None][chore] unwaive qwen3 30b test by @kris1025 in #10115
[TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #10384
[None][test] update test case constraint by @crazydemo in #10381
[https://nvbugs/5769926] [fix] Add no container mount home WAR by @kaiyux in #10431
[TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #10367
[TRTLLM-9622][infra] Enable DGX_B300 multi-gpu testing in pre-merge pipeline by @yiqingy0 in #9699
[TRTLLM-9896][test] add vswa test cases coverage by @crazydemo in #10146
[None] [fix] Fix undefined tokens_per_block by @kaiyux in #10438
[https://nvbugs/5772361][ci] Unwaive tests that have been fixed by @2ez4bz...

Contributors

Superjomn, karljang, and 58 other contributors

Assets 2

Releases: NVIDIA/TensorRT-LLM

v1.3.0rc5

Highlights

What's Changed

Contributors

Uh oh!

v1.3.0rc4

Highlights

What's Changed

Contributors

Uh oh!

v1.3.0rc3

Highlights:

What's Changed

Contributors

Uh oh!

v1.2.0rc6.post3

What's Changed

Contributors

Uh oh!

v1.2.0rc2.post2

What's Changed

Contributors

Uh oh!

v1.3.0rc2

Highlights:

What's Changed

Contributors

Uh oh!

v1.3.0rc1

Highlights

What's Changed

Contributors

Uh oh!

v1.2.0rc6.post2

What's Changed

Contributors

Uh oh!

v1.3.0rc0

Highlights

What's Changed

Contributors

Uh oh!

v1.2.0rc8

Highlights

What's Changed

Contributors

Uh oh!