[None][fix] Nemotron Super fix#11425
Conversation
|
/bot run |
📝 WalkthroughWalkthroughModified KV cache configuration logic in Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@tensorrt_llm/_torch/pyexecutor/_util.py`:
- Around line 121-127: The draft KV cache size calculation overcounts when MTP
falls back to the target model config: in
_should_create_separate_draft_kv_cache() you call _get_effective_draft_config()
and then kv_cache_manager_cls.get_cache_size_per_token(...) which uses the full
target layer count; change this to compute and pass the actual draft layer count
used by _create_one_model_draft_kv_cache_manager (call get_num_spec_layers or
derive num_draft_layers from layer_mask) so get_cache_size_per_token receives a
draft-specific config or an explicit num_draft_layers parameter (i.e., create a
short-lived draft_config with the correct layer count or add num_draft_layers
arg) to ensure the estimated kv_size_per_token matches the allocated draft KV
cache.
| elif self._should_create_separate_draft_kv_cache(): | ||
| # One-model draft with separate KV cache layout | ||
| effective_draft_config = self._get_effective_draft_config() | ||
| kv_size_per_token += self._kv_cache_manager_cls.get_cache_size_per_token( | ||
| self._draft_config, | ||
| effective_draft_config, | ||
| mapping, | ||
| tokens_per_block=self._tokens_per_block) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, find the relevant Python files
echo "=== Finding tensorrt_llm module files ==="
fd -t f -e py "kv.*manager\|executor.*util" | head -20Repository: NVIDIA/TensorRT-LLM
Length of output: 104
🏁 Script executed:
#!/bin/bash
# Look at the specific file mentioned in the review
echo "=== Checking _util.py file ==="
cat -n tensorrt_llm/_torch/pyexecutor/_util.py | head -150Repository: NVIDIA/TensorRT-LLM
Length of output: 8269
🏁 Script executed:
#!/bin/bash
# Search for _get_effective_draft_config method
echo "=== Search for _get_effective_draft_config ==="
rg -n "_get_effective_draft_config" -C 5 --type=pyRepository: NVIDIA/TensorRT-LLM
Length of output: 3579
🏁 Script executed:
#!/bin/bash
# Search for _create_one_model_draft_kv_cache_manager
echo "=== Search for _create_one_model_draft_kv_cache_manager ==="
rg -n "_create_one_model_draft_kv_cache_manager" -C 10 --type=pyRepository: NVIDIA/TensorRT-LLM
Length of output: 4131
🏁 Script executed:
#!/bin/bash
# Search for get_cache_size_per_token implementations
echo "=== Search for get_cache_size_per_token implementations ==="
rg -n "def get_cache_size_per_token" -C 15 --type=pyRepository: NVIDIA/TensorRT-LLM
Length of output: 12828
🏁 Script executed:
#!/bin/bash
# Read the full implementation of KVCacheManager.get_cache_size_per_token
echo "=== KVCacheManager.get_cache_size_per_token full implementation ==="
sed -n '802,900p' tensorrt_llm/_torch/pyexecutor/resource_manager.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 4914
🏁 Script executed:
#!/bin/bash
# Read the full implementation of MambaHybridCacheManager.get_cache_size_per_token
echo "=== MambaHybridCacheManager.get_cache_size_per_token ==="
rg -n "class MambaHybridCacheManager" --type=py -A 200 | head -250Repository: NVIDIA/TensorRT-LLM
Length of output: 10094
🏁 Script executed:
#!/bin/bash
# Read the full implementation of _create_one_model_draft_kv_cache_manager
echo "=== _create_one_model_draft_kv_cache_manager full implementation ==="
sed -n '557,620p' tensorrt_llm/_torch/pyexecutor/_util.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 3353
🏁 Script executed:
#!/bin/bash
# Check what get_num_attention_layers returns from the model config
echo "=== Searching for get_num_attention_layers method ==="
rg -n "def get_num_attention_layers" --type=py -A 3
# Also check ModelConfig to understand its structure
echo "=== Searching for ModelConfig class definition ==="
rg -n "class ModelConfig" --type=py -A 10 | head -50Repository: NVIDIA/TensorRT-LLM
Length of output: 4173
🏁 Script executed:
#!/bin/bash
# Search for _create_kv_cache_manager helper function to see if it uses num_layers
echo "=== Searching for _create_kv_cache_manager function ==="
rg -n "def _create_kv_cache_manager" --type=py -C 5Repository: NVIDIA/TensorRT-LLM
Length of output: 3294
🏁 Script executed:
#!/bin/bash
# Look at the _create_kv_cache_manager function fully to see how it handles num_layers
echo "=== Full _create_kv_cache_manager function (module-level) ==="
sed -n '646,800p' tensorrt_llm/_torch/pyexecutor/_util.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 6966
🏁 Script executed:
#!/bin/bash
# Check what the complete get_num_attention_layers method returns
echo "=== Complete get_num_attention_layers ==="
sed -n '770,780p' tensorrt_llm/_torch/model_config.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 803
🏁 Script executed:
#!/bin/bash
# Check if get_cache_size_per_token in MambaHybridCacheManager is overridden
echo "=== Search for get_cache_size_per_token in MambaHybridCacheManager ==="
rg -n "class MambaHybridCacheManager" --type=py -A 300 | grep -A 30 "def get_cache_size_per_token"Repository: NVIDIA/TensorRT-LLM
Length of output: 133
🏁 Script executed:
#!/bin/bash
# Look at how get_num_spec_layers works
echo "=== Search for get_num_spec_layers ==="
rg -n "def get_num_spec_layers" --type=py -A 10Repository: NVIDIA/TensorRT-LLM
Length of output: 1004
🏁 Script executed:
#!/bin/bash
# Check if get_cache_size_per_token for MambaHybridCacheManager uses layer_mask
echo "=== Check KVCacheManager.get_cache_size_per_token signature ==="
sed -n '802,850p' tensorrt_llm/_torch/pyexecutor/resource_manager.py | head -60Repository: NVIDIA/TensorRT-LLM
Length of output: 2343
🏁 Script executed:
#!/bin/bash
# Now let's trace through the exact values for nemotron_hybrid
# First check: does get_cache_size_per_token accept layer_mask parameter?
echo "=== Checking if get_cache_size_per_token accepts layer_mask ==="
rg -n "def get_cache_size_per_token.*layer_mask" --type=py
# And check KVCacheManagerV2.get_cache_size_per_token
echo "=== Checking KVCacheManagerV2.get_cache_size_per_token ==="
rg -n "class KVCacheManagerV2" --type=py -A 50 | grep -A 20 "def get_cache_size_per_token"Repository: NVIDIA/TensorRT-LLM
Length of output: 185
🏁 Script executed:
#!/bin/bash
# Let me understand the actual issue better by checking the calculation in detail
# For nemotron_hybrid, when layer_mask is None (in size estimation), what does it count?
echo "=== Look at the full get_cache_size_per_token code path for nemotron_hybrid ==="
sed -n '802,855p' tensorrt_llm/_torch/pyexecutor/resource_manager.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 2625
🏁 Script executed:
#!/bin/bash
# Now let's verify the actual numbers by checking the code flow one more time
# The key question: Does get_cache_size_per_token for nemotron_hybrid use get_num_attention_layers()?
echo "=== Verify the calculation path ==="
cat tensorrt_llm/_torch/model_config.py | sed -n '770,782p'
# For nemotron_hybrid, get_num_attention_layers returns pattern.count("*")
# This is used in get_cache_size_per_token at line 832:
echo ""
echo "=== Line 832 in resource_manager.py uses get_num_attention_layers() ==="
sed -n '830,835p' tensorrt_llm/_torch/pyexecutor/resource_manager.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 1202
🏁 Script executed:
#!/bin/bash
# Now verify: In _create_one_model_draft_kv_cache_manager, what layer count is actually passed?
echo "=== What num_draft_layers is passed to _create_kv_cache_manager? ==="
sed -n '569,608p' tensorrt_llm/_torch/pyexecutor/_util.py | grep -A 40 "num_draft_layers"Repository: NVIDIA/TensorRT-LLM
Length of output: 1979
🏁 Script executed:
#!/bin/bash
# Trace the exact calculation: estimation vs actual for nemotron_hybrid
echo "=== Verify MTP num_nextn_predict_layers vs target attention layers ==="
rg -n "num_nextn_predict_layers" --type=py -C 3 | head -40Repository: NVIDIA/TensorRT-LLM
Length of output: 3499
🏁 Script executed:
#!/bin/bash
# Check if the layer_mask is actually used in the size estimation
echo "=== Does get_cache_size_per_token accept layer_mask parameter? ==="
rg -n "def get_cache_size_per_token" --type=py -B 2 -A 1 | head -30Repository: NVIDIA/TensorRT-LLM
Length of output: 1820
🏁 Script executed:
#!/bin/bash
# Verify the complete picture: for nemotron_hybrid, what happens in _get_kv_size_per_token
echo "=== Full verification: nemotron_hybrid MTP case ==="
echo "1. _get_effective_draft_config for MTP returns:"
sed -n '540,555p' tensorrt_llm/_torch/pyexecutor/_util.py | tail -5
echo ""
echo "2. For nemotron_hybrid, get_num_attention_layers() returns pattern.count('*')"
echo " This is used in get_cache_size_per_token at line 832"
echo ""
echo "3. But actual allocation passes num_draft_layers (from get_num_spec_layers):"
sed -n '569,572p' tensorrt_llm/_torch/pyexecutor/_util.py
echo ""
echo "4. The layer_mask is created with only num_draft_layers True values:"
sed -n '575,577p' tensorrt_llm/_torch/pyexecutor/_util.py
echo ""
echo "Conclusion: Estimation uses full target attention layers, allocation uses only draft layers"Repository: NVIDIA/TensorRT-LLM
Length of output: 1159
🏁 Script executed:
#!/bin/bash
# Final check: is the size estimation actually used for memory calculation?
echo "=== Is kv_size_per_token used in memory calculation? ==="
sed -n '130,147p' tensorrt_llm/_torch/pyexecutor/_util.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 1222
Draft KV cache size estimation overcounts for MTP on hybrid models.
When effective_draft_config falls back to the target model config for MTP, get_cache_size_per_token is called with the full target config—including all attention layers from hybrid_override_pattern.count("*") for nemotron_hybrid models. However, the actual draft KV cache (created in _create_one_model_draft_kv_cache_manager) only allocates for num_draft_layers (from get_num_spec_layers), specified via layer_mask. This mismatch means the memory budget reserved for the draft is significantly larger than what's actually needed, reducing available capacity for the target KV cache.
Pass num_draft_layers to the size calculation or use a draft-specific config with the correct layer count to avoid over-reserving memory.
🤖 Prompt for AI Agents
In `@tensorrt_llm/_torch/pyexecutor/_util.py` around lines 121 - 127, The draft KV
cache size calculation overcounts when MTP falls back to the target model
config: in _should_create_separate_draft_kv_cache() you call
_get_effective_draft_config() and then
kv_cache_manager_cls.get_cache_size_per_token(...) which uses the full target
layer count; change this to compute and pass the actual draft layer count used
by _create_one_model_draft_kv_cache_manager (call get_num_spec_layers or derive
num_draft_layers from layer_mask) so get_cache_size_per_token receives a
draft-specific config or an explicit num_draft_layers parameter (i.e., create a
short-lived draft_config with the correct layer count or add num_draft_layers
arg) to ensure the estimated kv_size_per_token matches the allocated draft KV
cache.
|
PR_Github #35521 [ run ] triggered by Bot. Commit: |
|
PR_Github #35521 [ run ] completed with state
|
|
/bot run |
b78dff9 to
b954710
Compare
|
PR_Github #35578 [ run ] triggered by Bot. Commit: |
b954710 to
9713821
Compare
|
/bot run |
|
PR_Github #35587 [ run ] triggered by Bot. Commit: |
|
PR_Github #35587 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #35644 [ run ] triggered by Bot. Commit: |
|
PR_Github #35644 [ run ] completed with state |
Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
1be697e to
4ae5969
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #35686 [ run ] triggered by Bot. Commit: |
brb-nv
left a comment
There was a problem hiding this comment.
Minor refactor comment. LGTM.
|
PR_Github #35686 [ run ] completed with state
|
|
/bot run |
|
/bot run |
|
PR_Github #35945 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #35956 [ run ] triggered by Bot. Commit: |
|
PR_Github #35956 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #36014 [ run ] triggered by Bot. Commit: |
|
PR_Github #36014 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #36086 [ run ] triggered by Bot. Commit: |
|
PR_Github #36086 [ run ] completed with state
|
127d746 to
bc98799
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #36261 [ run ] triggered by Bot. Commit: |
|
PR_Github #36261 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #36294 [ run ] triggered by Bot. Commit: |
|
PR_Github #36294 [ run ] completed with state
|
bc98799 to
d3efb9a
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #36303 [ run ] triggered by Bot. Commit: |
|
PR_Github #36303 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #36315 [ run ] triggered by Bot. Commit: |
|
PR_Github #36315 [ run ] completed with state |
Summary by CodeRabbit
Release Notes
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.