Skip to content

Comments

[None][feat] Add performance alignment to layer-wise benchmarks#11018

Merged
yuantailing merged 21 commits intoNVIDIA:mainfrom
yuantailing:layer_wise_benchmarks
Jan 29, 2026
Merged

[None][feat] Add performance alignment to layer-wise benchmarks#11018
yuantailing merged 21 commits intoNVIDIA:mainfrom
yuantailing:layer_wise_benchmarks

Conversation

@yuantailing
Copy link
Member

@yuantailing yuantailing commented Jan 27, 2026

Summary by CodeRabbit

Release Notes

  • New Features

    • Added layer-wise performance alignment workflow with COLLECT and MARK calibration modes for profiling and benchmarking optimization.
    • Introduced calibration system with NONE, MARK, COLLECT, and REPLAY modes for managing kernel execution tracking and data collection.
    • Added interactive HTML visualization dashboard for correlating and comparing kernel execution timelines across different profiling runs.
  • Documentation

    • Added performance alignment section with end-to-end workflow documentation and configuration examples.
  • Tests

    • Added new test for performance alignment workflow validation.

✏️ Tip: You can customize this high-level summary in your review settings.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 27, 2026

📝 Walkthrough

Walkthrough

This patch introduces a comprehensive layer-wise calibration and benchmarking system. It adds a Calibrator class with COLLECT/MARK/REPLAY modes, integrates it into MOE routing and PyExecutor, provides benchmarking utilities for trace parsing and kernel correlation, and delivers workflow scripts for end-to-end performance alignment analysis.

Changes

Cohort / File(s) Summary
Calibrator Core
tensorrt_llm/tools/layer_wise_benchmarks/calibrator.py, tensorrt_llm/tools/layer_wise_benchmarks/__init__.py, tensorrt_llm/llmapi/llm_args.py
Introduces Calibrator class with four modes (NONE, MARK, COLLECT, REPLAY) managing lifecycle (start/pre_step/post_step/stop), layer-wise NVTX markers, slot data collection/replay with GPU buffers, metadata verification, and per-rank synchronization. Adds LayerwiseBenchmarksConfig to LLM args. Exports get_calibrator() factory function.
MOE and Executor Integration
tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py, tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py, tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py, tensorrt_llm/_torch/pyexecutor/py_executor.py, tensorrt_llm/_torch/pyexecutor/py_executor_creator.py, tensorrt_llm/tools/layer_wise_benchmarks/runner.py
Integrates calibrator into MOE routing (calls maybe_collect_or_replay_slots on token_selected_slots), PyExecutor profiling lifecycle (start/pre_step/post_step/stop callbacks), and Runner model setup (adds public model attribute, replaces per-layer iteration with single model call).
Benchmarking Utilities
examples/layer_wise_benchmarks/parser_utils.py, examples/layer_wise_benchmarks/correlation.py, examples/layer_wise_benchmarks/parse_e2e.py
Introduces kernel utilities (lazy SQLite conversion, kernel short name resolution, shortest common supersequence algorithm with optional Numba JIT). Adds correlation.py for mapping target timelines to reference via SCS and linear interpolation. Adds parse_e2e.py for extracting and aligning kernels across eager/graph NSYS traces.
Parse and Template Updates
examples/layer_wise_benchmarks/parse.py, examples/layer_wise_benchmarks/correlation_template.html
Refactors parse.py to use parser_utils functions, switches to breakdown_template.html, generates correlation JSON timeline output, adds Memcpy/Memset mappings, adjusts NVTX filtering and warmup handling. Introduces correlation_template.html with dual interactive ECharts (Duration, End Time) supporting dataZoom, tooltips, and responsive layout.
Workflow Scripts and Configuration
examples/layer_wise_benchmarks/run.py, examples/layer_wise_benchmarks/sample_performance_alignment.sh, examples/layer_wise_benchmarks/middleware/mpi_env_from_ompi, examples/layer_wise_benchmarks/slurm_alloc.sh, examples/layer_wise_benchmarks/slurm_init_containers.sh
Extends run.py with --replay-file-path and related CLI args, calibrator orchestration, prefill context environment patches. Introduces sample_performance_alignment.sh with 5-step workflow (dataset prep, COLLECT phase, MARK phase, run gen, post-processing and correlation). Adds mpi_env_from_ompi bridge script and updates slurm scripts (job naming, SLURM-aware arch detection).
Documentation and Tests
examples/layer_wise_benchmarks/README.md, tests/unittest/tools/test_layer_wise_benchmarks.py, tests/integration/test_lists/test-db/l0_b200.yml
Adds "Performance alignment" section to README (duplicated, +113 lines). Introduces test_performance_alignment parameterized over world_size [1, 4] running sample_performance_alignment.sh. Updates test database to include new test entry.

Sequence Diagram(s)

sequenceDiagram
    actor App as Application
    participant Cal as Calibrator
    participant MOE as MOE Module
    participant GPU as GPU Buffer
    participant File as File Storage

    App->>Cal: init(mode=COLLECT, ...)
    activate Cal
    Cal->>GPU: allocate fixed buffers
    deactivate Cal

    App->>Cal: start()
    activate Cal
    Cal->>Cal: reset state
    deactivate Cal

    loop Per Iteration
        App->>Cal: pre_step(it)
        App->>MOE: forward()
        activate MOE
        MOE->>Cal: maybe_collect_or_replay_slots()
        activate Cal
        Cal->>GPU: record slot data
        deactivate Cal
        MOE-->>App: token_selected_slots
        deactivate MOE
        App->>Cal: post_step(it)
        activate Cal
        Cal->>GPU: copy iteration metadata
        deactivate Cal
    end

    App->>Cal: stop()
    activate Cal
    Cal->>File: save collected data (all ranks)
    deactivate Cal
Loading
sequenceDiagram
    actor App as Application
    participant Cal as Calibrator
    participant File as File Storage
    participant MOE as MOE Module
    participant GPU as GPU Buffer

    App->>File: read calibration file
    App->>Cal: init(mode=REPLAY, file_path=..., layer_indices=...)
    activate Cal
    Cal->>Cal: load and validate replay data across ranks
    Cal->>GPU: allocate graph-compatible buffers
    deactivate Cal

    App->>Cal: start()
    activate Cal
    Cal->>Cal: initialize replay state
    deactivate Cal

    loop Per Iteration
        App->>Cal: pre_step(it)
        activate Cal
        Cal->>Cal: prepare replay data for iteration
        deactivate Cal
        App->>MOE: forward()
        activate MOE
        MOE->>Cal: maybe_collect_or_replay_slots()
        activate Cal
        Cal->>GPU: load and apply replay slot data
        Cal-->>MOE: replayed token_selected_slots
        deactivate Cal
        MOE-->>App: result
        deactivate MOE
        App->>Cal: post_step(it)
        activate Cal
        Cal->>Cal: record actual metadata for verification
        deactivate Cal
    end

    App->>Cal: stop()
    activate Cal
    Cal->>Cal: verify actual vs. recorded metadata
    deactivate Cal
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 38.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning The PR description is incomplete and consists only of the repository template with placeholder sections and no custom content filled in by the author. Please provide a clear description of what this PR does and why. Fill in at least the Description and Test Coverage sections with relevant details about the performance alignment feature being added.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding performance alignment functionality to layer-wise benchmarks, which aligns with the extensive changes shown in the raw summary.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 20

🤖 Fix all issues with AI agents
In `@examples/layer_wise_benchmarks/correlation_template.html`:
- Line 47: The current lookup for referenceData uses
s.series.search('reference') == 0 which is unclear and brittle; update the
predicate to use s.series.startsWith('reference') and guard against undefined
series (e.g., s.series && s.series.startsWith('reference')) so rawData.find(...)
reliably finds entries whose series begins with "reference"; update the
referenceData declaration accordingly.

In `@examples/layer_wise_benchmarks/correlation.py`:
- Around line 1-6: Add the missing NVIDIA copyright header to the top of
correlation.py: insert the project's standard multi-line NVIDIA copyright notice
with the year of latest meaningful modification as used across the repo
(matching other source files), placed above all imports and module code (before
the existing imports like kernel_short_name and shortest_common_supersequence)
so the file now includes the required header.

In `@examples/layer_wise_benchmarks/middleware/mpi_env_from_ompi`:
- Around line 3-8: The script uses set -u so missing OMPI_* vars produce an
unhelpful "unbound variable" error; add explicit validation before exporting
WORLD_SIZE, RANK, LOCAL_RANK, and NODE_RANK by checking OMPI_COMM_WORLD_SIZE,
OMPI_COMM_WORLD_RANK, OMPI_COMM_WORLD_LOCAL_RANK, and OMPI_COMM_WORLD_NODE_RANK
and printing clear, actionable error messages and exiting non‑zero if any are
unset or empty (or alternatively provide explicit defaults if intended); update
the export block (the four export lines) to run only after these checks so
failures are deterministic and readable when not run under Open MPI.

In `@examples/layer_wise_benchmarks/parse_e2e.py`:
- Around line 25-27: The CLI argument "--target-gen-reqs" is parsed as a string
which causes mismatches when compared to integer values (leading to empty
eager_iters); update the call to parser.add_argument("--target-gen-reqs") to
parse integers by adding type=int so comparisons with parsed ints succeed—modify
the parser.add_argument call for "--target-gen-reqs" (and any related usage that
assumes an int) to use type=int.
- Around line 1-7: Add the standard NVIDIA TensorRT-LLM copyright header
(updated year 2026) at the very top of the file parse_e2e.py, placing it before
any imports; ensure the header matches the project's canonical NVIDIA header
format and includes the copyright notice, license text and modification year
2026.
- Around line 191-195: The range check uses the leaked loop variable
eager_layers instead of the intended list for iteration 0; update the
comparisons in the block that builds eager_per_layer_kernels to reference
per_layer_eager_layers[0] explicitly (e.g., replace
eager_layers[eager_layers_idx][1] with
per_layer_eager_layers[0][eager_layers_idx][1]) so eager_layers_idx and
eager_kernel are validated against the correct layer list; keep existing bisect
on per_layer_eager_layers[0] and ensure all accesses use that same explicit
list.
- Around line 206-212: The code uses bisect.bisect(..., key=...) which is
unsupported before Python 3.10; replace that call by precomputing a list of
start indices from super_per_layer_kernels (e.g., starts = [t[0] for t in
super_per_layer_kernels]) and then use bisect.bisect(starts, j) - 1 to compute
layer_idx; update the logic that references
super_per_layer_kernels[layer_idx][1] and appends to
graph_per_layer_kernels[layer_idx] accordingly so behavior remains identical.

In `@examples/layer_wise_benchmarks/parser_utils.py`:
- Around line 1-6: Add the standard NVIDIA copyright/header to the top of this
new module (parser_utils.py) with the current modification year 2026; place the
header comment block before the first import and ensure it matches the project's
standard TensorRT-LLM header format (including copyright notice, license
reference, and any required SPDX or contribution lines) so that the file begins
with the exact required NVIDIA header followed by the existing imports (re,
subprocess, sys, numpy).

In `@examples/layer_wise_benchmarks/README.md`:
- Around line 214-253: Update the MARK-mode profile filename and fix the
numbered list ordering in the README: when instructing to run Step 1 again in
MARK mode, change the recommended nsys output argument from "-o
profiles/report_e2e_collect_rank%q{RANK}.nsys-rep" to a MARK-specific name such
as "-o profiles/report_mark_rank%q{RANK}.nsys-rep" to avoid confusion, and
renumber the “Here are explanations of every argument” list sequentially (1
through 8) so entries for NP, --load-format, --layer-indices, --batch-size,
--seq-len-q, --seq-len-kv-cache, --replay-file-path, and
--replay-start/--replay-stop appear in logical order and match the CLI example;
references to config fields (cuda_graph_config and
layer_wise_benchmarks_config.calibration_mode) remain unchanged.

In `@tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py`:
- Line 42: The calibrator is being invoked unconditionally in configurable_moe
(call to get_calibrator) even when routing is fused and token_selected_slots is
None; change the top-level import to keep the module namespace (import
tensorrt_llm.tools.layer_wise_benchmarks as layer_wise_benchmarks) and then
guard the calibrator call so it only runs when token_selected_slots is not None
(e.g., if token_selected_slots is not None: calibrator =
layer_wise_benchmarks.get_calibrator(...)); apply the same guard to the other
nearby calls around the existing get_calibrator usage (the block spanning the
current lines ~627–632).

In `@tensorrt_llm/_torch/pyexecutor/py_executor_creator.py`:
- Line 27: Update LayerwiseBenchmarksConfig to include a new field
replay_verify_metadata: Optional[bool] so the config can carry the flag; change
the import in py_executor_creator.py to preserve module namespace by using
import tensorrt_llm.tools.layer_wise_benchmarks as layer_wise_benchmarks
(instead of importing get_calibrator directly); then locate the call to
calibrator.init(...) in py_executor_creator.py and add the
replay_verify_metadata argument (pass the value from LayerwiseBenchmarksConfig)
so REPLAY mode no longer raises ValueError("missing replay_verify_metadata").

In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 38-40: Replace the direct submodule import so the module namespace
is preserved: change the import statement in py_executor.py from "from
tensorrt_llm.tools.layer_wise_benchmarks import get_calibrator" to "from
tensorrt_llm.tools import layer_wise_benchmarks", then update all calls to
get_calibrator (e.g., the invocation around line 740) to use
layer_wise_benchmarks.get_calibrator(); apply the same import-and-call pattern
to other files that currently import get_calibrator directly to comply with the
coding guideline.

In `@tensorrt_llm/llmapi/llm_args.py`:
- Around line 847-871: The Literal type for calibration_mode is missing
"REPLAY", causing validate_calibration_file_path (in LayerwiseBenchmarksConfig)
to reject REPLAY; update the calibration_mode field definition to include
"REPLAY" as an allowed Literal value (so calibration_mode can be set to "NONE",
"MARK", "COLLECT", or "REPLAY") and ensure the model_validator method
validate_calibration_file_path continues to check self.calibration_mode for
["COLLECT", "REPLAY"] as before.

In `@tensorrt_llm/tools/layer_wise_benchmarks/calibrator.py`:
- Around line 1-11: This new source file (calibrator.py) is missing the required
NVIDIA TensorRT-LLM copyright header; add the standard NVIDIA header block at
the top of the file with the latest modification year 2026, ensuring it matches
the project's header style and includes the copyright notice, license statement,
and any required contributor/ownership lines before the existing imports
(base64, functools, json, zlib, etc.) so the file complies with repository
coding guidelines.
- Around line 101-113: Before setting self.mode and calling _init_collect_mode
or _init_replay_mode, validate required inputs and raise clear ValueError
messages: ensure dist is provided when mode is COLLECT, ensure mapping and
layer_indices (and replay_verify_metadata) are provided when mode is REPLAY.
Update the block around Mode[mode] assignment to check mapping, dist, and
layer_indices early (referencing Mode, self.mode, mapping, dist, layer_indices,
replay_verify_metadata) and raise explicit errors before calling
_init_collect_mode or _init_replay_mode so missing inputs fail fast with clear
messages.
- Around line 672-698: The method get_replay_iteration_range currently computes
start_iter and stop_iter as the first and last element of sorted self._replay_db
keys but returns stop_iter inclusive while the docstring promises an exclusive
upper bound; change the return to return start_iter, stop_iter + 1 so callers
receive [start_iter, stop_iter) as documented, and keep the contiguous-range
verification using local_iterations != list(range(start_iter, stop_iter + 1))
unchanged (it still validates contiguity against the inclusive last iter).
Ensure the docstring remains the same and update any callers only if they relied
on the inclusive behavior.
- Around line 540-569: post_step() can overrun the fixed-size buffers
_collected_metadata_idx, _collected_slots_cpu, and
_collected_actual_metadata_idx because record_idx is computed from dynamic list
lengths without bounds checks; add explicit index bounds checks before accessing
these arrays in both Mode.COLLECT and Mode.REPLAY branches (i.e., before using
record_idx with _collected_metadata_idx.copy_, _collected_slots_cpu[record_idx],
and _collected_actual_metadata_idx.copy_), and raise a clear RuntimeError (or
ValueError) that includes the offending record_idx, the buffer name, and the
buffer capacity when record_idx >= len(buffer); keep behavior otherwise
unchanged.
- Around line 79-88: The init method uses Python 3.9+ style annotation
`list[int]` which breaks on 3.8 and also doesn't allow None as noted in the
docstring; update the signature of initializer `init` to use typing.Optional and
typing.List (e.g., layer_indices: Optional[List[int]]), ensure Optional and List
are imported from typing, and if relevant set the default for layer_indices to
None or handle None in the method body (since mode "COLLECT" may pass None).

In `@tensorrt_llm/tools/layer_wise_benchmarks/runner.py`:
- Around line 446-448: The code accesses model.model.layers[layer_indices[0]] to
set residual_fusion without ensuring layer_indices is non-empty, which can raise
IndexError; modify the logic around the residual_fusion assignment (before the
loop that iterates over layer_indices) to first check if layer_indices is
truthy/has length > 0 and only then access layer_indices[0], otherwise set
residual_fusion to a safe default (e.g., False) or handle the empty case
appropriately so the subsequent loop over layer_indices is safe; update any
downstream assumptions in the loop that use residual_fusion or expect at least
one layer.
- Around line 444-460: The constructor sets up a local layer_indices used by the
nested forward() but never assigns it to the instance, causing
replace_routing_method_ctx to fail when accessing self.layer_indices; fix by
assigning the incoming layer_indices parameter to self.layer_indices in the same
initializer (e.g., in __init__ assign self.layer_indices = layer_indices) so
both the closure-based forward and the method replace_routing_method_ctx
reference the same instance attribute.
🧹 Nitpick comments (9)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)

11-11: Prefer module import to preserve namespace.

This aligns with the Python import guideline and keeps the call site explicit.
As per coding guidelines, keep the module namespace on import.

♻️ Proposed refactor
-from tensorrt_llm.tools.layer_wise_benchmarks import get_calibrator
+import tensorrt_llm.tools.layer_wise_benchmarks as layer_wise_benchmarks
@@
-        token_selected_slots = get_calibrator().maybe_collect_or_replay_slots(
+        token_selected_slots = layer_wise_benchmarks.get_calibrator().maybe_collect_or_replay_slots(
             self.num_slots, token_selected_slots)

Also applies to: 443-444

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1)

11-11: Prefer module import to preserve namespace.

This follows the Python import guideline and keeps the call site explicit.
As per coding guidelines, keep the module namespace on import.

♻️ Proposed refactor
-from tensorrt_llm.tools.layer_wise_benchmarks import get_calibrator
+import tensorrt_llm.tools.layer_wise_benchmarks as layer_wise_benchmarks
@@
-        token_selected_slots = get_calibrator().maybe_collect_or_replay_slots(
+        token_selected_slots = layer_wise_benchmarks.get_calibrator().maybe_collect_or_replay_slots(
             self.num_slots, token_selected_slots)

Also applies to: 538-539

tensorrt_llm/llmapi/llm_args.py (1)

865-871: Ruff TRY003: avoid long inline exception message.

Ruff flags long messages inline in raise. Consider moving it to a constant/class var for readability.

♻️ Suggested tweak
 class LayerwiseBenchmarksConfig(StrictBaseModel):
     """
     Configuration for layer-wise benchmarks calibration.
     """
+    CALIBRATION_FILE_REQUIRED_MSG: ClassVar[str] = (
+        "calibration_file_path must be set when calibration_mode is COLLECT or REPLAY."
+    )

@@
     def validate_calibration_file_path(self) -> 'LayerwiseBenchmarksConfig':
         if self.calibration_mode in ["COLLECT", "REPLAY"
                                      ] and not self.calibration_file_path:
-            raise ValueError(
-                f"Expect calibration_file_path not to be empty when work on {self.calibration_mode} mode"
-            )
+            raise ValueError(self.CALIBRATION_FILE_REQUIRED_MSG)
         return self
examples/layer_wise_benchmarks/parse.py (2)

353-354: Enable Jinja2 autoescape for XSS mitigation.

While this generates local HTML files, enabling autoescape is a security best practice. The static analysis tool flagged this (S701).

Proposed fix
 loader = jinja2.FileSystemLoader(Path(__file__).parent)
-template = jinja2.Environment(loader=loader).get_template("breakdown_template.html")
+template = jinja2.Environment(loader=loader, autoescape=True).get_template("breakdown_template.html")

378-399: Consider adding strict=True to zip() for safety.

The static analysis tool flagged zip() without strict= (B905). While problem_set and kernels should have the same length by construction, adding strict=True provides a runtime check that catches mismatches early.

Proposed fix
 correlation = []
-for problem, runs in zip(problem_set, kernels):
+for problem, runs in zip(problem_set, kernels, strict=True):
     timeline = []
examples/layer_wise_benchmarks/sample_performance_alignment.sh (1)

121-130: Consider using parallel execution for independent parsing tasks.

The xargs -I% runs sequentially. For improved performance with multiple ranks, consider adding -P for parallel execution:

Proposed optimization
-seq 0 $((NP - 1)) | xargs -I% python3 parse_e2e.py \
+seq 0 $((NP - 1)) | xargs -P$NP -I% python3 parse_e2e.py \
     --eager-trace "$PROFILE_DIR/report_e2e_mark_rank%.nsys-rep" \
     --graph-trace "$PROFILE_DIR/report_e2e_collect_rank%.nsys-rep" \
     --layer-indices 5,6,7 \
     --warmup-times 5 \
     -o "$PROFILE_DIR/report_e2e_collect_rank%.json"
-seq 0 $((NP - 1)) | xargs -I% python3 parse.py \
+seq 0 $((NP - 1)) | xargs -P$NP -I% python3 parse.py \
     --profile-dir "$PROFILE_DIR" \
     --world-size $NP \
     --rank %
examples/layer_wise_benchmarks/correlation.py (2)

93-96: Enable Jinja2 autoescape for XSS mitigation.

The static analysis tool flagged this (S701). While generating local HTML files, enabling autoescape is a security best practice.

Proposed fix
 loader = jinja2.FileSystemLoader(Path(__file__).parent)
-template = jinja2.Environment(loader=loader).get_template("correlation_template.html")
+template = jinja2.Environment(loader=loader, autoescape=True).get_template("correlation_template.html")
 with open(args.output, "w") as f:
     f.write(template.render(rawData=data))

86-88: Consider adding strict=True to zip() for safety.

The static analysis tool flagged zip() without strict= (B905). Adding strict=True provides a runtime check that x_tgt and tgt_data["timeline"] have matching lengths.

Proposed fix
                     "duration": o["duration"] / 1000,
                     "end": o["end"] / 1000,
                 }
-                for x, o in zip(x_tgt, tgt_data["timeline"])
+                for x, o in zip(x_tgt, tgt_data["timeline"], strict=True)
             ],
examples/layer_wise_benchmarks/parse_e2e.py (1)

10-15: Keep parser_utils namespace in imports.

The guidelines require preserving the module namespace instead of importing symbols directly. Please switch to a module import and update call sites accordingly. As per coding guidelines, keep module namespaces in imports.

♻️ Suggested change
-from parser_utils import (
-    kernel_short_name,
-    lazy_convert_sqlite,
-    shortest_common_supersequence,
-    warned_names,
-)
+import parser_utils

Then update usages to parser_utils.kernel_short_name, parser_utils.lazy_convert_sqlite, parser_utils.shortest_common_supersequence, and parser_utils.warned_names.

@yuantailing yuantailing force-pushed the layer_wise_benchmarks branch from 3834e50 to bfceb88 Compare January 27, 2026 06:02
@yuantailing
Copy link
Member Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #33687 [ run ] triggered by Bot. Commit: bfceb88

@yuantailing
Copy link
Member Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #33689 [ run ] triggered by Bot. Commit: 00a8743

@yuantailing
Copy link
Member Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #33727 [ run ] triggered by Bot. Commit: 00a8743

@yuantailing yuantailing requested a review from a team as a code owner January 27, 2026 12:26
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
@yuantailing yuantailing force-pushed the layer_wise_benchmarks branch from de6bf6f to bd13002 Compare January 27, 2026 12:33
@yuantailing
Copy link
Member Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #33731 [ run ] triggered by Bot. Commit: bd13002

@tensorrt-cicd
Copy link
Collaborator

PR_Github #33731 [ run ] completed with state SUCCESS. Commit: bd13002
/LLM/main/L0_MergeRequest_PR pipeline #26016 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@yuantailing
Copy link
Member Author

/bot run --disable-fail-fast

Copy link
Collaborator

@Superjomn Superjomn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on the llmapi changes.

@Superjomn Superjomn requested a review from QiJune January 28, 2026 02:33
@tensorrt-cicd
Copy link
Collaborator

PR_Github #33800 [ run ] triggered by Bot. Commit: bd13002

@tensorrt-cicd
Copy link
Collaborator

PR_Github #33800 [ run ] completed with state FAILURE. Commit: bd13002
/LLM/main/L0_MergeRequest_PR pipeline #26068 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@yuantailing
Copy link
Member Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #33831 [ run ] triggered by Bot. Commit: bd13002

@yuantailing
Copy link
Member Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #33845 [ run ] triggered by Bot. Commit: bd13002

@tensorrt-cicd
Copy link
Collaborator

PR_Github #33845 [ run ] completed with state SUCCESS. Commit: bd13002
/LLM/main/L0_MergeRequest_PR pipeline #26100 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

Copy link
Collaborator

@QiJune QiJune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
@yuantailing
Copy link
Member Author

/bot skip --comment "only docs changes"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #33980 [ skip ] triggered by Bot. Commit: bd3203f

@tensorrt-cicd
Copy link
Collaborator

PR_Github #33980 [ skip ] completed with state SUCCESS. Commit: bd3203f
Skipping testing for commit bd3203f

@yuantailing yuantailing merged commit 9152836 into NVIDIA:main Jan 29, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants