[TRTLLM-10857][chore] Move SaveHiddenStates spec dec mode to 1 model#11241
[TRTLLM-10857][chore] Move SaveHiddenStates spec dec mode to 1 model#11241mikeiovine merged 4 commits intoNVIDIA:mainfrom
Conversation
97f094f to
feafa60
Compare
📝 WalkthroughWalkthroughThe pull request refactors the speculative decoding hidden states saving mechanism from a Drafter-based to a ResourceManager-based architecture. The SaveHiddenStatesDrafter is replaced with SaveHiddenStatesResourceManager and SaveHiddenStatesSpecMetadata. The post-forward flow now calls the resource manager's process_and_save method instead of the drafter's post-hook, and the drafter's run_drafter_post method is removed. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Important Action Needed: IP Allowlist UpdateIf your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:
Reviews will stop working after February 8, 2026 if the new IP is not added to your allowlist. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
tensorrt_llm/llmapi/llm_args.py (2)
1-5:⚠️ Potential issue | 🟡 MinorAdd required NVIDIA copyright header.
This TensorRT‑LLM source file starts directly with imports; please add the standard NVIDIA header with the latest modification year at the top of the file.
As per coding guidelines: All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.
1051-1067:⚠️ Potential issue | 🟠 MajorAlign SaveHiddenStates validation with the new default-capture behavior.
model_post_initnow treatseagle3_layers_to_capture=Noneas valid, butvalidate()still rejects falsy values. Ifvalidate()is invoked, the default-capture path will fail. Either allowNoneexplicitly or remove the default path/documentation.🔧 Proposed fix
def validate(self) -> None: - if self.output_directory is None or not self.eagle3_layers_to_capture: - raise ValueError( - "Save directory and layers to capture must be provided") + if self.output_directory is None: + raise ValueError("Save directory must be provided") + if self.eagle3_layers_to_capture is not None and len(self.eagle3_layers_to_capture) == 0: + raise ValueError("layers_to_capture must be non-empty when provided")tensorrt_llm/_torch/speculative/__init__.py (1)
1-3:⚠️ Potential issue | 🟡 MinorAdd required NVIDIA copyright header.
Please add the standard NVIDIA header with the latest modification year at the top of this source file.
As per coding guidelines: All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.tensorrt_llm/_torch/speculative/utils.py (1)
1-3:⚠️ Potential issue | 🟡 MinorAdd required NVIDIA copyright header.
Please add the standard NVIDIA header with the latest modification year at the top of this source file.
As per coding guidelines: All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.
🤖 Fix all issues with AI agents
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 1598-1607: The SaveHiddenStates feature is not validated against
pipeline parallelism causing silent failure when pp_size > 1 because only
_executor_loop calls spec_resource_mgr.process_and_save; either enforce
pp_size==1 in the SaveHiddenStatesDecodingConfig initializer (add an
assertion/validation referencing SaveHiddenStatesDecodingConfig and
llm_args.pp_size) or add the same hook into _executor_loop_pp (check for
ResourceManagerType.SPEC_RESOURCE_MANAGER, getattr(self.model_engine,
'spec_metadata', None), and call spec_resource_mgr.process_and_save for the
final pipeline rank) so SaveHiddenStates always runs or fails loudly when PP is
enabled.
In `@tensorrt_llm/_torch/speculative/save_hidden_state.py`:
- Around line 67-68: The function get_needed_resource_to_completion currently
declares an unused parameter request which triggers lint ARG002; update the
function signature in get_needed_resource_to_completion to rename request to
_request (e.g., def get_needed_resource_to_completion(self, _request:
LlmRequest):) or explicitly remove the variable by adding del request at the
top, so the linter stops flagging the unused parameter while preserving the
existing return behavior.
- Around line 135-171: In SaveHiddenStatesSpecMetadata.__post_init__, the
default capture computation calls
_get_eagle3_default_capture_layers(self.num_layers) which uses the wrong field;
change it to call _get_eagle3_default_capture_layers(self.num_model_layers) so
defaults are based on the actual model layer count, keep the rest of the logic
(sorting layers_to_capture, handling -1 last-layer marker, and setting
num_capture_layers) unchanged and ensure the reference to
SaveHiddenStatesSpecMetadata and method __post_init__ are updated accordingly.
- Around line 1-4: Add the standard NVIDIA copyright header (with the latest
modification year) at the very top of
tensorrt_llm/_torch/speculative/save_hidden_state.py before any imports; ensure
the header matches the project's canonical NVIDIA header text and formatting
used across other TensorRT-LLM source files and includes the appropriate year of
last meaningful modification.
🧹 Nitpick comments (5)
tensorrt_llm/_torch/speculative/eagle3.py (1)
111-112: Add a Google-style docstring for the new helper.Keeps the new utility compliant and clarifies the tuple semantics.
✍️ Suggested update
def _get_eagle3_default_capture_layers(num_layers: int): + """Return default Eagle3 layer indices to capture. + + Args: + num_layers: Total number of layers in the model. + + Returns: + Tuple of layer indices to capture. + """ return (1, num_layers // 2 - 1, num_layers - 4)As per coding guidelines: Use Google-style docstrings for Python classes and functions, which can be parsed by Sphinx.
tensorrt_llm/_torch/speculative/interface.py (1)
135-136: Consider documenting the updatedhas_spec_drafterbehavior.✍️ Suggested update
def has_spec_drafter(self): + """Return True if this mode uses a spec drafter.""" return self.is_eagle3() or self.is_draft_target() or self.is_ngram( ) or self.is_user_provided() or self.is_mtp_eagle()As per coding guidelines: Use Google-style docstrings for Python classes and functions, which can be parsed by Sphinx.
tensorrt_llm/_torch/speculative/__init__.py (1)
6-22: Keep save_hidden_state imports namespaced.The guidelines require module‑namespace imports. Consider importing the module and re‑exporting the symbols.
♻️ Suggested refactor
-from .save_hidden_state import (SaveHiddenStatesResourceManager, - SaveHiddenStatesSpecMetadata) +from . import save_hidden_state + +SaveHiddenStatesResourceManager = save_hidden_state.SaveHiddenStatesResourceManager +SaveHiddenStatesSpecMetadata = save_hidden_state.SaveHiddenStatesSpecMetadataAs per coding guidelines: Always maintain the namespace when importing Python modules, even if only one class or function from a module is used.
tensorrt_llm/_torch/speculative/utils.py (1)
17-18: Keep save_hidden_state imports namespaced.To comply with the import‑namespace guideline, import the module and qualify usages.
♻️ Suggested refactor
-from .save_hidden_state import (SaveHiddenStatesResourceManager, - SaveHiddenStatesSpecMetadata) +from . import save_hidden_state @@ - return SaveHiddenStatesSpecMetadata( + return save_hidden_state.SaveHiddenStatesSpecMetadata( @@ - return SaveHiddenStatesResourceManager( + return save_hidden_state.SaveHiddenStatesResourceManager(As per coding guidelines: Always maintain the namespace when importing Python modules, even if only one class or function from a module is used.
Also applies to: 84-95, 145-151
tensorrt_llm/_torch/speculative/save_hidden_state.py (1)
2-12: Keep internal imports namespaced.To comply with the namespace‑import guideline, import modules and qualify the base classes.
♻️ Suggested refactor
-from ..pyexecutor.resource_manager import BaseResourceManager -from .interface import SpecMetadata +from ..pyexecutor import resource_manager +from . import interface @@ -class SaveHiddenStatesResourceManager(BaseResourceManager): +class SaveHiddenStatesResourceManager(resource_manager.BaseResourceManager): @@ -class SaveHiddenStatesSpecMetadata(SpecMetadata): +class SaveHiddenStatesSpecMetadata(interface.SpecMetadata):As per coding guidelines: Always maintain the namespace when importing Python modules, even if only one class or function from a module is used.
Also applies to: 18-18, 136-136
feafa60 to
b70c4ec
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #34689 [ run ] triggered by Bot. Commit: |
|
PR_Github #34689 [ run ] completed with state
|
b70c4ec to
daffa6e
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #34813 [ run ] triggered by Bot. Commit: |
|
PR_Github #34813 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #34978 [ run ] triggered by Bot. Commit: |
|
PR_Github #34978 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #35120 [ run ] triggered by Bot. Commit: |
|
PR_Github #35120 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #35373 [ run ] triggered by Bot. Commit: |
|
PR_Github #35373 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #35518 [ run ] triggered by Bot. Commit: |
|
PR_Github #35518 [ run ] completed with state
|
Signed-off-by: Mike Iovine <miovine@nvidia.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
876761b to
c92f8a6
Compare
|
Caution Failed to replace (edit) comment. This is likely due to insufficient permissions or the comment being deleted. Error details |
|
/bot run --disable-fail-fast |
|
PR_Github #35645 [ run ] triggered by Bot. Commit: |
|
PR_Github #35645 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #35808 [ run ] triggered by Bot. Commit: |
|
PR_Github #35808 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #35921 [ run ] triggered by Bot. Commit: |
|
/bot run |
|
PR_Github #35937 [ run ] triggered by Bot. Commit: |
|
PR_Github #35937 [ run ] completed with state
|
|
/bot run |
|
PR_Github #36087 [ run ] triggered by Bot. Commit: |
|
PR_Github #36087 [ run ] completed with state |
Description
Remove another dependency on deprecated 2-model
Draftermachinery. Also attempt document the existing behavior in the code.Test Coverage
Existing tests pass.
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.
Summary by CodeRabbit
Release Notes
API Changes
Improvements