Skip to content

Comments

[TRTLLM-9771][feat] Allow overriding quantization configs#11062

Merged
shuyixiong merged 3 commits intoNVIDIA:mainfrom
shuyixiong:user/shuyix/support_fp8_refit
Jan 31, 2026
Merged

[TRTLLM-9771][feat] Allow overriding quantization configs#11062
shuyixiong merged 3 commits intoNVIDIA:mainfrom
shuyixiong:user/shuyix/support_fp8_refit

Conversation

@shuyixiong
Copy link
Collaborator

@shuyixiong shuyixiong commented Jan 28, 2026

Summary by CodeRabbit

Release Notes

  • New Features

    • Removed model_kwargs key filters, enabling quantization configuration overrides via model_kwargs..
  • Test Improvements

    • Updated tests to pass num_hidden_layers via model_kwargs.
    • Add update weights for fp8 refit

✏️ Tip: You can customize this high-level summary in your review settings.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Signed-off-by: shuyixiong <219646547+shuyixiong@users.noreply.github.com>
@shuyixiong shuyixiong requested review from a team as code owners January 28, 2026 08:53
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 28, 2026

📝 Walkthrough

Walkthrough

The changes introduce runtime model configuration through a new model_kwargs parameter in the LLM constructor, refactor quantization handling logic to consolidate multiple strategies with override support, and simplify test patterns by removing temporary directory copy workflows.

Changes

Cohort / File(s) Summary
Config and Quantization Handling
tensorrt_llm/_torch/model_config.py
Replaced strict key existence check with permissive getattr() to allow setting new attributes on config objects from model_kwargs instead of silently ignoring missing keys.
Quantization Processing
tensorrt_llm/llmapi/llm_utils.py
Consolidated quantization_config handling in _update_from_hf_quant_config with override support from model_kwargs. Added unified logic paths for FP8 block scales, MXFP4, and compressed-tensors methods with validation and compatibility checks (e.g., channel vs. block strategy requirements). Returns early only after valid quant_config is applied.
Test Refactoring
tests/unittest/_torch/ray_orchestrator/single_gpu/test_llm_update_weights.py
Removed TemporaryDirectory workflows and filesystem copy logic. Updated LLM constructor usage to accept model_kwargs parameter for runtime per-model configuration (e.g., num_hidden_layers, quantization_config). Added in-test weight filtering logic for selective IPC handle collection during update_weights RPC calls.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 22.22% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning The PR description is incomplete—it contains only the template structure without actual content in the Description and Test Coverage sections. Fill in the Description section explaining what changes enable quantization config overrides and why. Complete the Test Coverage section listing relevant tests (e.g., test_llm_update_weights_with_quant_config) that validate the changes.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main feature: allowing overriding of quantization configs, which aligns with the commit message and source branch intent.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@tensorrt_llm/llmapi/llm_utils.py`:
- Around line 410-418: The MXFP4 branch sets quant_config.exclude_modules
directly but doesn't merge HF-provided modules_to_not_convert like
ModelConfig.load_hf_quant_config does; update the block in llm_utils.py where
ModelConfig.get_mxfp4_quant_algo(...) is used so that after setting the default
exclude_modules you read hf_quant_config.get("modules_to_not_convert") (or
similar key) and extend/merge those entries into quant_config.exclude_modules
(ensuring no duplicates), and also set quant_config.modules_to_not_convert if
that attribute is used elsewhere; keep ModelConfig.get_mxfp4_quant_algo(...) and
group_size assignment unchanged while only adding the merge step.
- Around line 394-398: The f-string in the logger.info call inside the block
that checks self.llm_args.model_kwargs and "quantization_config" uses nested
double quotes which can break on Python <3.12; to fix, extract the value into a
temporary variable (e.g., quant_cfg =
self.llm_args.model_kwargs["quantization_config"]) or use alternate quoting
(single quotes for the dict key) and then log with logger.info(f"Update
hf_quant_config from model_kwargs: quantization_config={quant_cfg} (previous
value: {hf_quant_config})"); update the code around the hf_quant_config
assignment so it reads the temporary variable and then sets hf_quant_config =
quant_cfg.
- Around line 404-418: The FP8 branch in llm_utils.py sets quant_algo and
exclude_modules but misses group_size, block_size validation, and merging
modules_to_not_convert as done in ModelConfig.load_hf_quant_config; update the
FP8 handling (where QuantAlgo.FP8_BLOCK_SCALES is set) to: set
quant_config.group_size = hf_quant_config["block_size"][0], validate that
tuple(hf_quant_config.get("block_size",())) == (128, 128) and raise/handle
otherwise, set quant_config.block_size accordingly, and merge
quant_config.modules_to_not_convert with the default modules_to_not_convert from
ModelConfig.load_hf_quant_config (or ModelConfig defaults) so TRT-LLM and the
PyTorch path behave consistently.
🧹 Nitpick comments (3)
tensorrt_llm/_torch/model_config.py (1)

530-557: Permissive attribute setting removes validation for typos in model_kwargs keys.

The change from checking key existence to using getattr(config, key, None) allows setting arbitrary new attributes on the config. While this is necessary to support quantization_config overrides, it means typos in model_kwargs keys will silently create new attributes rather than warning the user.

Consider adding a warning when setting a key that doesn't exist on the config (while still allowing it):

💡 Suggested improvement
 for key, value_new in update_dict.items():
     target_value = getattr(config, key, None)
+    if target_value is None and not hasattr(config, key):
+        logger.warning(
+            f"model_kwargs key '{key}' does not exist on config, creating new attribute"
+        )

     # Handle nested PretrainedConfig objects when value is a dict
tensorrt_llm/llmapi/llm_utils.py (1)

400-470: Code duplication with ModelConfig.load_hf_quant_config.

The quantization config handling logic here duplicates ModelConfig.load_hf_quant_config in tensorrt_llm/_torch/model_config.py (lines 315-407 in the provided context). Both handle the same quant methods (fp8, mxfp4, compressed-tensors) with similar but slightly different logic.

Consider refactoring to call ModelConfig.load_hf_quant_config directly or extracting a shared utility function to avoid divergence and maintenance burden.

tests/unittest/_torch/ray_orchestrator/single_gpu/test_llm_update_weights.py (1)

139-147: Consider adding explicit LLM cleanup to release GPU resources.

The LLM instances are created but not explicitly cleaned up. While pytest will eventually garbage collect them, explicit cleanup could help with GPU memory management, especially when running multiple tests.

💡 Optional improvement
# At the end of each test, add:
llm.shutdown()  # or appropriate cleanup method if available

Alternatively, use a pytest fixture with yield for automatic cleanup.

Also applies to: 188-196, 250-266

@shuyixiong
Copy link
Collaborator Author

/bot run --disable-fail-fast

@Superjomn Superjomn requested a review from syuoni January 28, 2026 09:12
@tensorrt-cicd
Copy link
Collaborator

PR_Github #33860 [ run ] triggered by Bot. Commit: d198ca4

Signed-off-by: shuyixiong <219646547+shuyixiong@users.noreply.github.com>
@shuyixiong
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #33872 [ run ] triggered by Bot. Commit: ca1c20f

@tensorrt-cicd
Copy link
Collaborator

PR_Github #33872 [ run ] completed with state SUCCESS. Commit: ca1c20f
/LLM/main/L0_MergeRequest_PR pipeline #26120 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.
Pipeline has performance regression cases. Check the performance regression report for details.

Copy link
Collaborator

@syuoni syuoni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor comments, thanks!

Signed-off-by: shuyixiong <219646547+shuyixiong@users.noreply.github.com>
@shuyixiong
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #33975 [ run ] triggered by Bot. Commit: f0d9118

@tensorrt-cicd
Copy link
Collaborator

PR_Github #33975 [ run ] completed with state SUCCESS. Commit: f0d9118
/LLM/main/L0_MergeRequest_PR pipeline #26208 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@shuyixiong
Copy link
Collaborator Author

/bot run --disable-fail-fast

@shuyixiong shuyixiong enabled auto-merge (squash) January 29, 2026 09:51
@tensorrt-cicd
Copy link
Collaborator

PR_Github #34042 [ run ] triggered by Bot. Commit: f0d9118

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34042 [ run ] completed with state SUCCESS. Commit: f0d9118
/LLM/main/L0_MergeRequest_PR pipeline #26263 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@shuyixiong
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34053 [ run ] triggered by Bot. Commit: f0d9118

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34053 [ run ] completed with state SUCCESS. Commit: f0d9118
/LLM/main/L0_MergeRequest_PR pipeline #26272 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@shuyixiong
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34195 [ run ] triggered by Bot. Commit: f0d9118

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34195 [ run ] completed with state SUCCESS. Commit: f0d9118
/LLM/main/L0_MergeRequest_PR pipeline #26385 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@shuyixiong
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34306 [ run ] triggered by Bot. Commit: f0d9118

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34306 [ run ] completed with state FAILURE. Commit: f0d9118
LLM/main/L0_MergeRequest_PR #26458 (Blue Ocean) completed with status: ABORTED

@shuyixiong
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34315 [ run ] triggered by Bot. Commit: f0d9118

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34315 [ run ] completed with state SUCCESS. Commit: f0d9118
/LLM/main/L0_MergeRequest_PR pipeline #26467 completed with status: 'SUCCESS'

@shuyixiong shuyixiong merged commit 278ced9 into NVIDIA:main Jan 31, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants