[TRTLLM-9771][feat] Allow overriding quantization configs#11062

shuyixiong · 2026-01-28T08:53:38Z

Summary by CodeRabbit

Release Notes

New Features
- Removed model_kwargs key filters, enabling quantization configuration overrides via model_kwargs..
Test Improvements
- Updated tests to pass num_hidden_layers via model_kwargs.
- Add update weights for fp8 refit

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Signed-off-by: shuyixiong <219646547+shuyixiong@users.noreply.github.com>

coderabbitai · 2026-01-28T08:59:09Z

📝 Walkthrough

Walkthrough

The changes introduce runtime model configuration through a new model_kwargs parameter in the LLM constructor, refactor quantization handling logic to consolidate multiple strategies with override support, and simplify test patterns by removing temporary directory copy workflows.

Changes

Cohort / File(s)	Summary
Config and Quantization Handling `tensorrt_llm/_torch/model_config.py`	Replaced strict key existence check with permissive `getattr()` to allow setting new attributes on config objects from model_kwargs instead of silently ignoring missing keys.
Quantization Processing `tensorrt_llm/llmapi/llm_utils.py`	Consolidated quantization_config handling in `_update_from_hf_quant_config` with override support from model_kwargs. Added unified logic paths for FP8 block scales, MXFP4, and compressed-tensors methods with validation and compatibility checks (e.g., channel vs. block strategy requirements). Returns early only after valid quant_config is applied.
Test Refactoring `tests/unittest/_torch/ray_orchestrator/single_gpu/test_llm_update_weights.py`	Removed TemporaryDirectory workflows and filesystem copy logic. Updated LLM constructor usage to accept `model_kwargs` parameter for runtime per-model configuration (e.g., num_hidden_layers, quantization_config). Added in-test weight filtering logic for selective IPC handle collection during update_weights RPC calls.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 22.22% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	⚠️ Warning	The PR description is incomplete—it contains only the template structure without actual content in the Description and Test Coverage sections.	Fill in the Description section explaining what changes enable quantization config overrides and why. Complete the Test Coverage section listing relevant tests (e.g., test_llm_update_weights_with_quant_config) that validate the changes.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main feature: allowing overriding of quantization configs, which aligns with the commit message and source branch intent.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@tensorrt_llm/llmapi/llm_utils.py`:
- Around line 410-418: The MXFP4 branch sets quant_config.exclude_modules
directly but doesn't merge HF-provided modules_to_not_convert like
ModelConfig.load_hf_quant_config does; update the block in llm_utils.py where
ModelConfig.get_mxfp4_quant_algo(...) is used so that after setting the default
exclude_modules you read hf_quant_config.get("modules_to_not_convert") (or
similar key) and extend/merge those entries into quant_config.exclude_modules
(ensuring no duplicates), and also set quant_config.modules_to_not_convert if
that attribute is used elsewhere; keep ModelConfig.get_mxfp4_quant_algo(...) and
group_size assignment unchanged while only adding the merge step.
- Around line 394-398: The f-string in the logger.info call inside the block
that checks self.llm_args.model_kwargs and "quantization_config" uses nested
double quotes which can break on Python <3.12; to fix, extract the value into a
temporary variable (e.g., quant_cfg =
self.llm_args.model_kwargs["quantization_config"]) or use alternate quoting
(single quotes for the dict key) and then log with logger.info(f"Update
hf_quant_config from model_kwargs: quantization_config={quant_cfg} (previous
value: {hf_quant_config})"); update the code around the hf_quant_config
assignment so it reads the temporary variable and then sets hf_quant_config =
quant_cfg.
- Around line 404-418: The FP8 branch in llm_utils.py sets quant_algo and
exclude_modules but misses group_size, block_size validation, and merging
modules_to_not_convert as done in ModelConfig.load_hf_quant_config; update the
FP8 handling (where QuantAlgo.FP8_BLOCK_SCALES is set) to: set
quant_config.group_size = hf_quant_config["block_size"][0], validate that
tuple(hf_quant_config.get("block_size",())) == (128, 128) and raise/handle
otherwise, set quant_config.block_size accordingly, and merge
quant_config.modules_to_not_convert with the default modules_to_not_convert from
ModelConfig.load_hf_quant_config (or ModelConfig defaults) so TRT-LLM and the
PyTorch path behave consistently.

🧹 Nitpick comments (3)

tensorrt_llm/_torch/model_config.py (1)
530-557: Permissive attribute setting removes validation for typos in model_kwargs keys.

The change from checking key existence to using getattr(config, key, None) allows setting arbitrary new attributes on the config. While this is necessary to support quantization_config overrides, it means typos in model_kwargs keys will silently create new attributes rather than warning the user.

Consider adding a warning when setting a key that doesn't exist on the config (while still allowing it):
💡 Suggested improvement
 for key, value_new in update_dict.items():
     target_value = getattr(config, key, None)
+    if target_value is None and not hasattr(config, key):
+        logger.warning(
+            f"model_kwargs key '{key}' does not exist on config, creating new attribute"
+        )

     # Handle nested PretrainedConfig objects when value is a dict
tensorrt_llm/llmapi/llm_utils.py (1)

400-470: Code duplication with ModelConfig.load_hf_quant_config.

The quantization config handling logic here duplicates ModelConfig.load_hf_quant_config in tensorrt_llm/_torch/model_config.py (lines 315-407 in the provided context). Both handle the same quant methods (fp8, mxfp4, compressed-tensors) with similar but slightly different logic.

Consider refactoring to call ModelConfig.load_hf_quant_config directly or extracting a shared utility function to avoid divergence and maintenance burden.
tests/unittest/_torch/ray_orchestrator/single_gpu/test_llm_update_weights.py (1)
139-147: Consider adding explicit LLM cleanup to release GPU resources.

The LLM instances are created but not explicitly cleaned up. While pytest will eventually garbage collect them, explicit cleanup could help with GPU memory management, especially when running multiple tests.
💡 Optional improvement
# At the end of each test, add:
llm.shutdown()  # or appropriate cleanup method if available
Alternatively, use a pytest fixture with yield for automatic cleanup.
Also applies to: 188-196, 250-266

tensorrt_llm/llmapi/llm_utils.py

shuyixiong · 2026-01-28T09:08:33Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-28T09:14:54Z

PR_Github #33860 [ run ] triggered by Bot. Commit: d198ca4

Signed-off-by: shuyixiong <219646547+shuyixiong@users.noreply.github.com>

shuyixiong · 2026-01-28T10:55:23Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-28T11:01:56Z

PR_Github #33872 [ run ] triggered by Bot. Commit: ca1c20f

tensorrt-cicd · 2026-01-28T16:17:24Z

PR_Github #33872 [ run ] completed with state SUCCESS. Commit: ca1c20f
/LLM/main/L0_MergeRequest_PR pipeline #26120 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.
Pipeline has performance regression cases. Check the performance regression report for details.

syuoni

Left some minor comments, thanks!

tensorrt_llm/llmapi/llm_utils.py

Signed-off-by: shuyixiong <219646547+shuyixiong@users.noreply.github.com>

shuyixiong · 2026-01-29T04:56:23Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-29T05:03:19Z

PR_Github #33975 [ run ] triggered by Bot. Commit: f0d9118

tensorrt-cicd · 2026-01-29T09:36:44Z

PR_Github #33975 [ run ] completed with state SUCCESS. Commit: f0d9118
/LLM/main/L0_MergeRequest_PR pipeline #26208 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

shuyixiong · 2026-01-29T09:51:19Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-29T09:56:56Z

PR_Github #34042 [ run ] triggered by Bot. Commit: f0d9118

tensorrt-cicd · 2026-01-29T11:46:59Z

PR_Github #34042 [ run ] completed with state SUCCESS. Commit: f0d9118
/LLM/main/L0_MergeRequest_PR pipeline #26263 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

shuyixiong · 2026-01-29T11:51:24Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-29T11:57:28Z

PR_Github #34053 [ run ] triggered by Bot. Commit: f0d9118

tensorrt-cicd · 2026-01-29T13:29:57Z

PR_Github #34053 [ run ] completed with state SUCCESS. Commit: f0d9118
/LLM/main/L0_MergeRequest_PR pipeline #26272 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

shuyixiong · 2026-01-30T07:48:55Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-30T07:54:21Z

PR_Github #34195 [ run ] triggered by Bot. Commit: f0d9118

tensorrt-cicd · 2026-01-30T11:45:04Z

PR_Github #34195 [ run ] completed with state SUCCESS. Commit: f0d9118
/LLM/main/L0_MergeRequest_PR pipeline #26385 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

shuyixiong · 2026-01-31T13:41:07Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-31T13:47:15Z

PR_Github #34306 [ run ] triggered by Bot. Commit: f0d9118

tensorrt-cicd · 2026-01-31T14:31:05Z

PR_Github #34306 [ run ] completed with state FAILURE. Commit: f0d9118
LLM/main/L0_MergeRequest_PR #26458 (Blue Ocean) completed with status: ABORTED

shuyixiong · 2026-01-31T15:15:22Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-31T15:21:38Z

PR_Github #34315 [ run ] triggered by Bot. Commit: f0d9118

tensorrt-cicd · 2026-01-31T15:48:49Z

PR_Github #34315 [ run ] completed with state SUCCESS. Commit: f0d9118
/LLM/main/L0_MergeRequest_PR pipeline #26467 completed with status: 'SUCCESS'

Support for fp8 refit

d198ca4

Signed-off-by: shuyixiong <219646547+shuyixiong@users.noreply.github.com>

shuyixiong requested review from a team as code owners January 28, 2026 08:53

shuyixiong requested review from Naveassaf, hchings and yilin-void January 28, 2026 08:53

coderabbitai bot reviewed Jan 28, 2026

View reviewed changes

tensorrt_llm/llmapi/llm_utils.py Show resolved Hide resolved

tensorrt_llm/llmapi/llm_utils.py Show resolved Hide resolved

tensorrt_llm/llmapi/llm_utils.py Show resolved Hide resolved

Superjomn requested a review from syuoni January 28, 2026 09:12

Minor fix

ca1c20f

Signed-off-by: shuyixiong <219646547+shuyixiong@users.noreply.github.com>

syuoni approved these changes Jan 29, 2026

View reviewed changes

tensorrt_llm/llmapi/llm_utils.py Outdated Show resolved Hide resolved

tensorrt_llm/llmapi/llm_utils.py Show resolved Hide resolved

Refine logger info

f0d9118

Signed-off-by: shuyixiong <219646547+shuyixiong@users.noreply.github.com>

shuyixiong enabled auto-merge (squash) January 29, 2026 09:51

shuyixiong merged commit 278ced9 into NVIDIA:main Jan 31, 2026
5 checks passed

Comments

Conversation

shuyixiong commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shuyixiong commented Jan 28, 2026

Uh oh!

tensorrt-cicd commented Jan 28, 2026

Uh oh!

shuyixiong commented Jan 28, 2026

Uh oh!

tensorrt-cicd commented Jan 28, 2026

Uh oh!

tensorrt-cicd commented Jan 28, 2026

Uh oh!

syuoni left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

shuyixiong commented Jan 29, 2026

Uh oh!

tensorrt-cicd commented Jan 29, 2026

Uh oh!

tensorrt-cicd commented Jan 29, 2026

Uh oh!

shuyixiong commented Jan 29, 2026

Uh oh!

tensorrt-cicd commented Jan 29, 2026

Uh oh!

tensorrt-cicd commented Jan 29, 2026

Uh oh!

shuyixiong commented Jan 29, 2026

Uh oh!

tensorrt-cicd commented Jan 29, 2026

Uh oh!

tensorrt-cicd commented Jan 29, 2026

Uh oh!

shuyixiong commented Jan 30, 2026

Uh oh!

tensorrt-cicd commented Jan 30, 2026

Uh oh!

tensorrt-cicd commented Jan 30, 2026

Uh oh!

shuyixiong commented Jan 31, 2026

Uh oh!

tensorrt-cicd commented Jan 31, 2026

Uh oh!

tensorrt-cicd commented Jan 31, 2026

Uh oh!

shuyixiong commented Jan 31, 2026

Uh oh!

tensorrt-cicd commented Jan 31, 2026

Uh oh!

tensorrt-cicd commented Jan 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

shuyixiong commented Jan 28, 2026 •

edited

Loading

coderabbitai bot commented Jan 28, 2026 •

edited

Loading