[TRTLLM-9771][feat] Allow overriding quantization configs#11062
[TRTLLM-9771][feat] Allow overriding quantization configs#11062shuyixiong merged 3 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: shuyixiong <219646547+shuyixiong@users.noreply.github.com>
📝 WalkthroughWalkthroughThe changes introduce runtime model configuration through a new Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@tensorrt_llm/llmapi/llm_utils.py`:
- Around line 410-418: The MXFP4 branch sets quant_config.exclude_modules
directly but doesn't merge HF-provided modules_to_not_convert like
ModelConfig.load_hf_quant_config does; update the block in llm_utils.py where
ModelConfig.get_mxfp4_quant_algo(...) is used so that after setting the default
exclude_modules you read hf_quant_config.get("modules_to_not_convert") (or
similar key) and extend/merge those entries into quant_config.exclude_modules
(ensuring no duplicates), and also set quant_config.modules_to_not_convert if
that attribute is used elsewhere; keep ModelConfig.get_mxfp4_quant_algo(...) and
group_size assignment unchanged while only adding the merge step.
- Around line 394-398: The f-string in the logger.info call inside the block
that checks self.llm_args.model_kwargs and "quantization_config" uses nested
double quotes which can break on Python <3.12; to fix, extract the value into a
temporary variable (e.g., quant_cfg =
self.llm_args.model_kwargs["quantization_config"]) or use alternate quoting
(single quotes for the dict key) and then log with logger.info(f"Update
hf_quant_config from model_kwargs: quantization_config={quant_cfg} (previous
value: {hf_quant_config})"); update the code around the hf_quant_config
assignment so it reads the temporary variable and then sets hf_quant_config =
quant_cfg.
- Around line 404-418: The FP8 branch in llm_utils.py sets quant_algo and
exclude_modules but misses group_size, block_size validation, and merging
modules_to_not_convert as done in ModelConfig.load_hf_quant_config; update the
FP8 handling (where QuantAlgo.FP8_BLOCK_SCALES is set) to: set
quant_config.group_size = hf_quant_config["block_size"][0], validate that
tuple(hf_quant_config.get("block_size",())) == (128, 128) and raise/handle
otherwise, set quant_config.block_size accordingly, and merge
quant_config.modules_to_not_convert with the default modules_to_not_convert from
ModelConfig.load_hf_quant_config (or ModelConfig defaults) so TRT-LLM and the
PyTorch path behave consistently.
🧹 Nitpick comments (3)
tensorrt_llm/_torch/model_config.py (1)
530-557: Permissive attribute setting removes validation for typos in model_kwargs keys.The change from checking key existence to using
getattr(config, key, None)allows setting arbitrary new attributes on the config. While this is necessary to supportquantization_configoverrides, it means typos inmodel_kwargskeys will silently create new attributes rather than warning the user.Consider adding a warning when setting a key that doesn't exist on the config (while still allowing it):
💡 Suggested improvement
for key, value_new in update_dict.items(): target_value = getattr(config, key, None) + if target_value is None and not hasattr(config, key): + logger.warning( + f"model_kwargs key '{key}' does not exist on config, creating new attribute" + ) # Handle nested PretrainedConfig objects when value is a dicttensorrt_llm/llmapi/llm_utils.py (1)
400-470: Code duplication withModelConfig.load_hf_quant_config.The quantization config handling logic here duplicates
ModelConfig.load_hf_quant_configintensorrt_llm/_torch/model_config.py(lines 315-407 in the provided context). Both handle the same quant methods (fp8,mxfp4,compressed-tensors) with similar but slightly different logic.Consider refactoring to call
ModelConfig.load_hf_quant_configdirectly or extracting a shared utility function to avoid divergence and maintenance burden.tests/unittest/_torch/ray_orchestrator/single_gpu/test_llm_update_weights.py (1)
139-147: Consider adding explicit LLM cleanup to release GPU resources.The
LLMinstances are created but not explicitly cleaned up. While pytest will eventually garbage collect them, explicit cleanup could help with GPU memory management, especially when running multiple tests.💡 Optional improvement
# At the end of each test, add: llm.shutdown() # or appropriate cleanup method if availableAlternatively, use a pytest fixture with
yieldfor automatic cleanup.Also applies to: 188-196, 250-266
|
/bot run --disable-fail-fast |
|
PR_Github #33860 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #33872 [ run ] triggered by Bot. Commit: |
|
PR_Github #33872 [ run ] completed with state |
syuoni
left a comment
There was a problem hiding this comment.
Left some minor comments, thanks!
Signed-off-by: shuyixiong <219646547+shuyixiong@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #33975 [ run ] triggered by Bot. Commit: |
|
PR_Github #33975 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #34042 [ run ] triggered by Bot. Commit: |
|
PR_Github #34042 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #34053 [ run ] triggered by Bot. Commit: |
|
PR_Github #34053 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #34195 [ run ] triggered by Bot. Commit: |
|
PR_Github #34195 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #34306 [ run ] triggered by Bot. Commit: |
|
PR_Github #34306 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #34315 [ run ] triggered by Bot. Commit: |
|
PR_Github #34315 [ run ] completed with state |
Summary by CodeRabbit
Release Notes
New Features
model_kwargskey filters, enabling quantization configuration overrides viamodel_kwargs..Test Improvements
num_hidden_layersviamodel_kwargs.✏️ Tip: You can customize this high-level summary in your review settings.
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.