[None][feat] Add performance alignment to layer-wise benchmarks#11018

yuantailing · 2026-01-27T04:25:34Z

Summary by CodeRabbit

Release Notes

New Features
- Added layer-wise performance alignment workflow with COLLECT and MARK calibration modes for profiling and benchmarking optimization.
- Introduced calibration system with NONE, MARK, COLLECT, and REPLAY modes for managing kernel execution tracking and data collection.
- Added interactive HTML visualization dashboard for correlating and comparing kernel execution timelines across different profiling runs.
Documentation
- Added performance alignment section with end-to-end workflow documentation and configuration examples.
Tests
- Added new test for performance alignment workflow validation.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2026-01-27T04:39:50Z

📝 Walkthrough

Walkthrough

This patch introduces a comprehensive layer-wise calibration and benchmarking system. It adds a Calibrator class with COLLECT/MARK/REPLAY modes, integrates it into MOE routing and PyExecutor, provides benchmarking utilities for trace parsing and kernel correlation, and delivers workflow scripts for end-to-end performance alignment analysis.

Changes

Cohort / File(s)	Summary
Calibrator Core `tensorrt_llm/tools/layer_wise_benchmarks/calibrator.py`, `tensorrt_llm/tools/layer_wise_benchmarks/__init__.py`, `tensorrt_llm/llmapi/llm_args.py`	Introduces Calibrator class with four modes (NONE, MARK, COLLECT, REPLAY) managing lifecycle (start/pre_step/post_step/stop), layer-wise NVTX markers, slot data collection/replay with GPU buffers, metadata verification, and per-rank synchronization. Adds LayerwiseBenchmarksConfig to LLM args. Exports get_calibrator() factory function.
MOE and Executor Integration `tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py`, `tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py`, `tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py`, `tensorrt_llm/_torch/pyexecutor/py_executor.py`, `tensorrt_llm/_torch/pyexecutor/py_executor_creator.py`, `tensorrt_llm/tools/layer_wise_benchmarks/runner.py`	Integrates calibrator into MOE routing (calls maybe_collect_or_replay_slots on token_selected_slots), PyExecutor profiling lifecycle (start/pre_step/post_step/stop callbacks), and Runner model setup (adds public model attribute, replaces per-layer iteration with single model call).
Benchmarking Utilities `examples/layer_wise_benchmarks/parser_utils.py`, `examples/layer_wise_benchmarks/correlation.py`, `examples/layer_wise_benchmarks/parse_e2e.py`	Introduces kernel utilities (lazy SQLite conversion, kernel short name resolution, shortest common supersequence algorithm with optional Numba JIT). Adds correlation.py for mapping target timelines to reference via SCS and linear interpolation. Adds parse_e2e.py for extracting and aligning kernels across eager/graph NSYS traces.
Parse and Template Updates `examples/layer_wise_benchmarks/parse.py`, `examples/layer_wise_benchmarks/correlation_template.html`	Refactors parse.py to use parser_utils functions, switches to breakdown_template.html, generates correlation JSON timeline output, adds Memcpy/Memset mappings, adjusts NVTX filtering and warmup handling. Introduces correlation_template.html with dual interactive ECharts (Duration, End Time) supporting dataZoom, tooltips, and responsive layout.
Workflow Scripts and Configuration `examples/layer_wise_benchmarks/run.py`, `examples/layer_wise_benchmarks/sample_performance_alignment.sh`, `examples/layer_wise_benchmarks/middleware/mpi_env_from_ompi`, `examples/layer_wise_benchmarks/slurm_alloc.sh`, `examples/layer_wise_benchmarks/slurm_init_containers.sh`	Extends run.py with --replay-file-path and related CLI args, calibrator orchestration, prefill context environment patches. Introduces sample_performance_alignment.sh with 5-step workflow (dataset prep, COLLECT phase, MARK phase, run gen, post-processing and correlation). Adds mpi_env_from_ompi bridge script and updates slurm scripts (job naming, SLURM-aware arch detection).
Documentation and Tests `examples/layer_wise_benchmarks/README.md`, `tests/unittest/tools/test_layer_wise_benchmarks.py`, `tests/integration/test_lists/test-db/l0_b200.yml`	Adds "Performance alignment" section to README (duplicated, +113 lines). Introduces test_performance_alignment parameterized over world_size [1, 4] running sample_performance_alignment.sh. Updates test database to include new test entry.

Sequence Diagram(s)

sequenceDiagram
    actor App as Application
    participant Cal as Calibrator
    participant MOE as MOE Module
    participant GPU as GPU Buffer
    participant File as File Storage

    App->>Cal: init(mode=COLLECT, ...)
    activate Cal
    Cal->>GPU: allocate fixed buffers
    deactivate Cal

    App->>Cal: start()
    activate Cal
    Cal->>Cal: reset state
    deactivate Cal

    loop Per Iteration
        App->>Cal: pre_step(it)
        App->>MOE: forward()
        activate MOE
        MOE->>Cal: maybe_collect_or_replay_slots()
        activate Cal
        Cal->>GPU: record slot data
        deactivate Cal
        MOE-->>App: token_selected_slots
        deactivate MOE
        App->>Cal: post_step(it)
        activate Cal
        Cal->>GPU: copy iteration metadata
        deactivate Cal
    end

    App->>Cal: stop()
    activate Cal
    Cal->>File: save collected data (all ranks)
    deactivate Cal

sequenceDiagram
    actor App as Application
    participant Cal as Calibrator
    participant File as File Storage
    participant MOE as MOE Module
    participant GPU as GPU Buffer

    App->>File: read calibration file
    App->>Cal: init(mode=REPLAY, file_path=..., layer_indices=...)
    activate Cal
    Cal->>Cal: load and validate replay data across ranks
    Cal->>GPU: allocate graph-compatible buffers
    deactivate Cal

    App->>Cal: start()
    activate Cal
    Cal->>Cal: initialize replay state
    deactivate Cal

    loop Per Iteration
        App->>Cal: pre_step(it)
        activate Cal
        Cal->>Cal: prepare replay data for iteration
        deactivate Cal
        App->>MOE: forward()
        activate MOE
        MOE->>Cal: maybe_collect_or_replay_slots()
        activate Cal
        Cal->>GPU: load and apply replay slot data
        Cal-->>MOE: replayed token_selected_slots
        deactivate Cal
        MOE-->>App: result
        deactivate MOE
        App->>Cal: post_step(it)
        activate Cal
        Cal->>Cal: record actual metadata for verification
        deactivate Cal
    end

    App->>Cal: stop()
    activate Cal
    Cal->>Cal: verify actual vs. recorded metadata
    deactivate Cal

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 38.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	⚠️ Warning	The PR description is incomplete and consists only of the repository template with placeholder sections and no custom content filled in by the author.	Please provide a clear description of what this PR does and why. Fill in at least the Description and Test Coverage sections with relevant details about the performance alignment feature being added.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: adding performance alignment functionality to layer-wise benchmarks, which aligns with the extensive changes shown in the raw summary.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 20

🤖 Fix all issues with AI agents

In `@examples/layer_wise_benchmarks/correlation_template.html`:
- Line 47: The current lookup for referenceData uses
s.series.search('reference') == 0 which is unclear and brittle; update the
predicate to use s.series.startsWith('reference') and guard against undefined
series (e.g., s.series && s.series.startsWith('reference')) so rawData.find(...)
reliably finds entries whose series begins with "reference"; update the
referenceData declaration accordingly.

In `@examples/layer_wise_benchmarks/correlation.py`:
- Around line 1-6: Add the missing NVIDIA copyright header to the top of
correlation.py: insert the project's standard multi-line NVIDIA copyright notice
with the year of latest meaningful modification as used across the repo
(matching other source files), placed above all imports and module code (before
the existing imports like kernel_short_name and shortest_common_supersequence)
so the file now includes the required header.

In `@examples/layer_wise_benchmarks/middleware/mpi_env_from_ompi`:
- Around line 3-8: The script uses set -u so missing OMPI_* vars produce an
unhelpful "unbound variable" error; add explicit validation before exporting
WORLD_SIZE, RANK, LOCAL_RANK, and NODE_RANK by checking OMPI_COMM_WORLD_SIZE,
OMPI_COMM_WORLD_RANK, OMPI_COMM_WORLD_LOCAL_RANK, and OMPI_COMM_WORLD_NODE_RANK
and printing clear, actionable error messages and exiting non‑zero if any are
unset or empty (or alternatively provide explicit defaults if intended); update
the export block (the four export lines) to run only after these checks so
failures are deterministic and readable when not run under Open MPI.

In `@examples/layer_wise_benchmarks/parse_e2e.py`:
- Around line 25-27: The CLI argument "--target-gen-reqs" is parsed as a string
which causes mismatches when compared to integer values (leading to empty
eager_iters); update the call to parser.add_argument("--target-gen-reqs") to
parse integers by adding type=int so comparisons with parsed ints succeed—modify
the parser.add_argument call for "--target-gen-reqs" (and any related usage that
assumes an int) to use type=int.
- Around line 1-7: Add the standard NVIDIA TensorRT-LLM copyright header
(updated year 2026) at the very top of the file parse_e2e.py, placing it before
any imports; ensure the header matches the project's canonical NVIDIA header
format and includes the copyright notice, license text and modification year
2026.
- Around line 191-195: The range check uses the leaked loop variable
eager_layers instead of the intended list for iteration 0; update the
comparisons in the block that builds eager_per_layer_kernels to reference
per_layer_eager_layers[0] explicitly (e.g., replace
eager_layers[eager_layers_idx][1] with
per_layer_eager_layers[0][eager_layers_idx][1]) so eager_layers_idx and
eager_kernel are validated against the correct layer list; keep existing bisect
on per_layer_eager_layers[0] and ensure all accesses use that same explicit
list.
- Around line 206-212: The code uses bisect.bisect(..., key=...) which is
unsupported before Python 3.10; replace that call by precomputing a list of
start indices from super_per_layer_kernels (e.g., starts = [t[0] for t in
super_per_layer_kernels]) and then use bisect.bisect(starts, j) - 1 to compute
layer_idx; update the logic that references
super_per_layer_kernels[layer_idx][1] and appends to
graph_per_layer_kernels[layer_idx] accordingly so behavior remains identical.

In `@examples/layer_wise_benchmarks/parser_utils.py`:
- Around line 1-6: Add the standard NVIDIA copyright/header to the top of this
new module (parser_utils.py) with the current modification year 2026; place the
header comment block before the first import and ensure it matches the project's
standard TensorRT-LLM header format (including copyright notice, license
reference, and any required SPDX or contribution lines) so that the file begins
with the exact required NVIDIA header followed by the existing imports (re,
subprocess, sys, numpy).

In `@examples/layer_wise_benchmarks/README.md`:
- Around line 214-253: Update the MARK-mode profile filename and fix the
numbered list ordering in the README: when instructing to run Step 1 again in
MARK mode, change the recommended nsys output argument from "-o
profiles/report_e2e_collect_rank%q{RANK}.nsys-rep" to a MARK-specific name such
as "-o profiles/report_mark_rank%q{RANK}.nsys-rep" to avoid confusion, and
renumber the “Here are explanations of every argument” list sequentially (1
through 8) so entries for NP, --load-format, --layer-indices, --batch-size,
--seq-len-q, --seq-len-kv-cache, --replay-file-path, and
--replay-start/--replay-stop appear in logical order and match the CLI example;
references to config fields (cuda_graph_config and
layer_wise_benchmarks_config.calibration_mode) remain unchanged.

In `@tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py`:
- Line 42: The calibrator is being invoked unconditionally in configurable_moe
(call to get_calibrator) even when routing is fused and token_selected_slots is
None; change the top-level import to keep the module namespace (import
tensorrt_llm.tools.layer_wise_benchmarks as layer_wise_benchmarks) and then
guard the calibrator call so it only runs when token_selected_slots is not None
(e.g., if token_selected_slots is not None: calibrator =
layer_wise_benchmarks.get_calibrator(...)); apply the same guard to the other
nearby calls around the existing get_calibrator usage (the block spanning the
current lines ~627–632).

In `@tensorrt_llm/_torch/pyexecutor/py_executor_creator.py`:
- Line 27: Update LayerwiseBenchmarksConfig to include a new field
replay_verify_metadata: Optional[bool] so the config can carry the flag; change
the import in py_executor_creator.py to preserve module namespace by using
import tensorrt_llm.tools.layer_wise_benchmarks as layer_wise_benchmarks
(instead of importing get_calibrator directly); then locate the call to
calibrator.init(...) in py_executor_creator.py and add the
replay_verify_metadata argument (pass the value from LayerwiseBenchmarksConfig)
so REPLAY mode no longer raises ValueError("missing replay_verify_metadata").

In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 38-40: Replace the direct submodule import so the module namespace
is preserved: change the import statement in py_executor.py from "from
tensorrt_llm.tools.layer_wise_benchmarks import get_calibrator" to "from
tensorrt_llm.tools import layer_wise_benchmarks", then update all calls to
get_calibrator (e.g., the invocation around line 740) to use
layer_wise_benchmarks.get_calibrator(); apply the same import-and-call pattern
to other files that currently import get_calibrator directly to comply with the
coding guideline.

In `@tensorrt_llm/llmapi/llm_args.py`:
- Around line 847-871: The Literal type for calibration_mode is missing
"REPLAY", causing validate_calibration_file_path (in LayerwiseBenchmarksConfig)
to reject REPLAY; update the calibration_mode field definition to include
"REPLAY" as an allowed Literal value (so calibration_mode can be set to "NONE",
"MARK", "COLLECT", or "REPLAY") and ensure the model_validator method
validate_calibration_file_path continues to check self.calibration_mode for
["COLLECT", "REPLAY"] as before.

In `@tensorrt_llm/tools/layer_wise_benchmarks/calibrator.py`:
- Around line 1-11: This new source file (calibrator.py) is missing the required
NVIDIA TensorRT-LLM copyright header; add the standard NVIDIA header block at
the top of the file with the latest modification year 2026, ensuring it matches
the project's header style and includes the copyright notice, license statement,
and any required contributor/ownership lines before the existing imports
(base64, functools, json, zlib, etc.) so the file complies with repository
coding guidelines.
- Around line 101-113: Before setting self.mode and calling _init_collect_mode
or _init_replay_mode, validate required inputs and raise clear ValueError
messages: ensure dist is provided when mode is COLLECT, ensure mapping and
layer_indices (and replay_verify_metadata) are provided when mode is REPLAY.
Update the block around Mode[mode] assignment to check mapping, dist, and
layer_indices early (referencing Mode, self.mode, mapping, dist, layer_indices,
replay_verify_metadata) and raise explicit errors before calling
_init_collect_mode or _init_replay_mode so missing inputs fail fast with clear
messages.
- Around line 672-698: The method get_replay_iteration_range currently computes
start_iter and stop_iter as the first and last element of sorted self._replay_db
keys but returns stop_iter inclusive while the docstring promises an exclusive
upper bound; change the return to return start_iter, stop_iter + 1 so callers
receive [start_iter, stop_iter) as documented, and keep the contiguous-range
verification using local_iterations != list(range(start_iter, stop_iter + 1))
unchanged (it still validates contiguity against the inclusive last iter).
Ensure the docstring remains the same and update any callers only if they relied
on the inclusive behavior.
- Around line 540-569: post_step() can overrun the fixed-size buffers
_collected_metadata_idx, _collected_slots_cpu, and
_collected_actual_metadata_idx because record_idx is computed from dynamic list
lengths without bounds checks; add explicit index bounds checks before accessing
these arrays in both Mode.COLLECT and Mode.REPLAY branches (i.e., before using
record_idx with _collected_metadata_idx.copy_, _collected_slots_cpu[record_idx],
and _collected_actual_metadata_idx.copy_), and raise a clear RuntimeError (or
ValueError) that includes the offending record_idx, the buffer name, and the
buffer capacity when record_idx >= len(buffer); keep behavior otherwise
unchanged.
- Around line 79-88: The init method uses Python 3.9+ style annotation
`list[int]` which breaks on 3.8 and also doesn't allow None as noted in the
docstring; update the signature of initializer `init` to use typing.Optional and
typing.List (e.g., layer_indices: Optional[List[int]]), ensure Optional and List
are imported from typing, and if relevant set the default for layer_indices to
None or handle None in the method body (since mode "COLLECT" may pass None).

In `@tensorrt_llm/tools/layer_wise_benchmarks/runner.py`:
- Around line 446-448: The code accesses model.model.layers[layer_indices[0]] to
set residual_fusion without ensuring layer_indices is non-empty, which can raise
IndexError; modify the logic around the residual_fusion assignment (before the
loop that iterates over layer_indices) to first check if layer_indices is
truthy/has length > 0 and only then access layer_indices[0], otherwise set
residual_fusion to a safe default (e.g., False) or handle the empty case
appropriately so the subsequent loop over layer_indices is safe; update any
downstream assumptions in the loop that use residual_fusion or expect at least
one layer.
- Around line 444-460: The constructor sets up a local layer_indices used by the
nested forward() but never assigns it to the instance, causing
replace_routing_method_ctx to fail when accessing self.layer_indices; fix by
assigning the incoming layer_indices parameter to self.layer_indices in the same
initializer (e.g., in __init__ assign self.layer_indices = layer_indices) so
both the closure-based forward and the method replace_routing_method_ctx
reference the same instance attribute.

🧹 Nitpick comments (9)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
11-11: Prefer module import to preserve namespace.

This aligns with the Python import guideline and keeps the call site explicit.
As per coding guidelines, keep the module namespace on import.
♻️ Proposed refactor
-from tensorrt_llm.tools.layer_wise_benchmarks import get_calibrator
+import tensorrt_llm.tools.layer_wise_benchmarks as layer_wise_benchmarks
@@
-        token_selected_slots = get_calibrator().maybe_collect_or_replay_slots(
+        token_selected_slots = layer_wise_benchmarks.get_calibrator().maybe_collect_or_replay_slots(
             self.num_slots, token_selected_slots)
Also applies to: 443-444
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1)
11-11: Prefer module import to preserve namespace.

This follows the Python import guideline and keeps the call site explicit.
As per coding guidelines, keep the module namespace on import.
♻️ Proposed refactor
-from tensorrt_llm.tools.layer_wise_benchmarks import get_calibrator
+import tensorrt_llm.tools.layer_wise_benchmarks as layer_wise_benchmarks
@@
-        token_selected_slots = get_calibrator().maybe_collect_or_replay_slots(
+        token_selected_slots = layer_wise_benchmarks.get_calibrator().maybe_collect_or_replay_slots(
             self.num_slots, token_selected_slots)
Also applies to: 538-539
tensorrt_llm/llmapi/llm_args.py (1)
865-871: Ruff TRY003: avoid long inline exception message.

Ruff flags long messages inline in raise. Consider moving it to a constant/class var for readability.
♻️ Suggested tweak
 class LayerwiseBenchmarksConfig(StrictBaseModel):
     """
     Configuration for layer-wise benchmarks calibration.
     """
+    CALIBRATION_FILE_REQUIRED_MSG: ClassVar[str] = (
+        "calibration_file_path must be set when calibration_mode is COLLECT or REPLAY."
+    )

@@
     def validate_calibration_file_path(self) -> 'LayerwiseBenchmarksConfig':
         if self.calibration_mode in ["COLLECT", "REPLAY"
                                      ] and not self.calibration_file_path:
-            raise ValueError(
-                f"Expect calibration_file_path not to be empty when work on {self.calibration_mode} mode"
-            )
+            raise ValueError(self.CALIBRATION_FILE_REQUIRED_MSG)
         return self
examples/layer_wise_benchmarks/parse.py (2)
353-354: Enable Jinja2 autoescape for XSS mitigation.

While this generates local HTML files, enabling autoescape is a security best practice. The static analysis tool flagged this (S701).
Proposed fix
 loader = jinja2.FileSystemLoader(Path(__file__).parent)
-template = jinja2.Environment(loader=loader).get_template("breakdown_template.html")
+template = jinja2.Environment(loader=loader, autoescape=True).get_template("breakdown_template.html")
378-399: Consider adding strict=True to zip() for safety.

The static analysis tool flagged zip() without strict= (B905). While problem_set and kernels should have the same length by construction, adding strict=True provides a runtime check that catches mismatches early.
Proposed fix
 correlation = []
-for problem, runs in zip(problem_set, kernels):
+for problem, runs in zip(problem_set, kernels, strict=True):
     timeline = []
examples/layer_wise_benchmarks/sample_performance_alignment.sh (1)
121-130: Consider using parallel execution for independent parsing tasks.

The xargs -I% runs sequentially. For improved performance with multiple ranks, consider adding -P for parallel execution:
Proposed optimization
-seq 0 $((NP - 1)) | xargs -I% python3 parse_e2e.py \
+seq 0 $((NP - 1)) | xargs -P$NP -I% python3 parse_e2e.py \
     --eager-trace "$PROFILE_DIR/report_e2e_mark_rank%.nsys-rep" \
     --graph-trace "$PROFILE_DIR/report_e2e_collect_rank%.nsys-rep" \
     --layer-indices 5,6,7 \
     --warmup-times 5 \
     -o "$PROFILE_DIR/report_e2e_collect_rank%.json"
-seq 0 $((NP - 1)) | xargs -I% python3 parse.py \
+seq 0 $((NP - 1)) | xargs -P$NP -I% python3 parse.py \
     --profile-dir "$PROFILE_DIR" \
     --world-size $NP \
     --rank %
examples/layer_wise_benchmarks/correlation.py (2)
93-96: Enable Jinja2 autoescape for XSS mitigation.

The static analysis tool flagged this (S701). While generating local HTML files, enabling autoescape is a security best practice.
Proposed fix
 loader = jinja2.FileSystemLoader(Path(__file__).parent)
-template = jinja2.Environment(loader=loader).get_template("correlation_template.html")
+template = jinja2.Environment(loader=loader, autoescape=True).get_template("correlation_template.html")
 with open(args.output, "w") as f:
     f.write(template.render(rawData=data))
86-88: Consider adding strict=True to zip() for safety.

The static analysis tool flagged zip() without strict= (B905). Adding strict=True provides a runtime check that x_tgt and tgt_data["timeline"] have matching lengths.
Proposed fix
                     "duration": o["duration"] / 1000,
                     "end": o["end"] / 1000,
                 }
-                for x, o in zip(x_tgt, tgt_data["timeline"])
+                for x, o in zip(x_tgt, tgt_data["timeline"], strict=True)
             ],
examples/layer_wise_benchmarks/parse_e2e.py (1)
10-15: Keep parser_utils namespace in imports.

The guidelines require preserving the module namespace instead of importing symbols directly. Please switch to a module import and update call sites accordingly. As per coding guidelines, keep module namespaces in imports.
♻️ Suggested change
-from parser_utils import (
-    kernel_short_name,
-    lazy_convert_sqlite,
-    shortest_common_supersequence,
-    warned_names,
-)
+import parser_utils
Then update usages to parser_utils.kernel_short_name, parser_utils.lazy_convert_sqlite, parser_utils.shortest_common_supersequence, and parser_utils.warned_names.

examples/layer_wise_benchmarks/correlation_template.html

examples/layer_wise_benchmarks/correlation.py

examples/layer_wise_benchmarks/middleware/mpi_env_from_ompi

examples/layer_wise_benchmarks/parse_e2e.py

tensorrt_llm/tools/layer_wise_benchmarks/calibrator.py

tensorrt_llm/tools/layer_wise_benchmarks/runner.py

yuantailing · 2026-01-27T06:08:59Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-27T06:15:00Z

PR_Github #33687 [ run ] triggered by Bot. Commit: bfceb88

yuantailing · 2026-01-27T06:24:51Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-27T06:30:19Z

PR_Github #33689 [ run ] triggered by Bot. Commit: 00a8743

yuantailing · 2026-01-27T11:06:03Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-27T11:11:37Z

PR_Github #33727 [ run ] triggered by Bot. Commit: 00a8743

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

yuantailing · 2026-01-27T12:33:27Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-27T12:38:51Z

PR_Github #33731 [ run ] triggered by Bot. Commit: bd13002

tensorrt-cicd · 2026-01-27T18:29:08Z

PR_Github #33731 [ run ] completed with state SUCCESS. Commit: bd13002
/LLM/main/L0_MergeRequest_PR pipeline #26016 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

yuantailing · 2026-01-28T02:30:54Z

/bot run --disable-fail-fast

Superjomn

LGTM on the llmapi changes.

tensorrt-cicd · 2026-01-28T02:37:13Z

PR_Github #33800 [ run ] triggered by Bot. Commit: bd13002

tensorrt-cicd · 2026-01-28T04:46:00Z

PR_Github #33800 [ run ] completed with state FAILURE. Commit: bd13002
/LLM/main/L0_MergeRequest_PR pipeline #26068 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

yuantailing · 2026-01-28T05:37:53Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-28T05:46:01Z

PR_Github #33831 [ run ] triggered by Bot. Commit: bd13002

yuantailing · 2026-01-28T07:12:51Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-28T07:18:32Z

PR_Github #33845 [ run ] triggered by Bot. Commit: bd13002

tensorrt-cicd · 2026-01-28T11:10:02Z

PR_Github #33845 [ run ] completed with state SUCCESS. Commit: bd13002
/LLM/main/L0_MergeRequest_PR pipeline #26100 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

QiJune

LGTM

examples/layer_wise_benchmarks/README.md

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

yuantailing · 2026-01-29T05:11:38Z

/bot skip --comment "only docs changes"

tensorrt-cicd · 2026-01-29T05:17:28Z

PR_Github #33980 [ skip ] triggered by Bot. Commit: bd3203f

tensorrt-cicd · 2026-01-29T06:00:51Z

PR_Github #33980 [ skip ] completed with state SUCCESS. Commit: bd3203f
Skipping testing for commit bd3203f

yuantailing requested review from a team as code owners January 27, 2026 04:25

yuantailing requested review from HuiGao-NV, laikhtewari, nv-guomingz and syuoni January 27, 2026 04:25

coderabbitai bot reviewed Jan 27, 2026

View reviewed changes

yuantailing force-pushed the layer_wise_benchmarks branch from 3834e50 to bfceb88 Compare January 27, 2026 06:02

yuantailing requested a review from a team as a code owner January 27, 2026 12:26

yuantailing requested a review from suyoggupta January 27, 2026 12:26

yuantailing added 11 commits January 27, 2026 20:33

Add correlation with E2E profiles

8849aca

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

Add calibrator

65c43dc

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

Move configs to LayerwiseBenchmarksConfig, add checks

77cd170

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

Add "Performance calibrating" section

c481233

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

Fix moe mock

22cb5a1

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

Update parser_keywords

10e8d67

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

Update README

635f10b

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

Fix: get machine hardware name on a worker

4708f54

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

Change param order of Calibrator.init

134d223

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

Adapt to ConfigurableMoE

2be6dbe

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

Add job name to salloc

b143292

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

yuantailing added 9 commits January 27, 2026 20:33

Ignore the trailing memcpy in calibrator replay

5275eb0

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

Add an option to disable metadata verification in REPLAY mode

4e9c00d

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

Add test_calibration

2395294

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

Change Calibrator.init param default

e601ab0

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

Rename "Performance calibration" -> "Performance alignment"

8ee43ba

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

Fix according to coderabbitai

5bca631

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

Fix l0_b200.yml

28c0bee

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

Fix for test_docstring

d5b0eae

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

Fix mode range

bd13002

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

yuantailing force-pushed the layer_wise_benchmarks branch from de6bf6f to bd13002 Compare January 27, 2026 12:33

Superjomn approved these changes Jan 28, 2026

View reviewed changes

Superjomn requested a review from QiJune January 28, 2026 02:33

QiJune approved these changes Jan 29, 2026

View reviewed changes

kaiyux approved these changes Jan 29, 2026

View reviewed changes

examples/layer_wise_benchmarks/README.md Outdated Show resolved Hide resolved

examples/layer_wise_benchmarks/README.md Outdated Show resolved Hide resolved

Update README according to suggestions

bd3203f

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

yuantailing merged commit 9152836 into NVIDIA:main Jan 29, 2026
5 checks passed

Comments

Conversation

yuantailing commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuantailing commented Jan 27, 2026

Uh oh!

tensorrt-cicd commented Jan 27, 2026

Uh oh!

yuantailing commented Jan 27, 2026

Uh oh!

tensorrt-cicd commented Jan 27, 2026

Uh oh!

yuantailing commented Jan 27, 2026

Uh oh!

tensorrt-cicd commented Jan 27, 2026

Uh oh!

yuantailing commented Jan 27, 2026

Uh oh!

tensorrt-cicd commented Jan 27, 2026

Uh oh!

tensorrt-cicd commented Jan 27, 2026

Uh oh!

yuantailing commented Jan 28, 2026

Uh oh!

Superjomn left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Jan 28, 2026

Uh oh!

tensorrt-cicd commented Jan 28, 2026

Uh oh!

yuantailing commented Jan 28, 2026

Uh oh!

tensorrt-cicd commented Jan 28, 2026

Uh oh!

yuantailing commented Jan 28, 2026

Uh oh!

tensorrt-cicd commented Jan 28, 2026

Uh oh!

tensorrt-cicd commented Jan 28, 2026

Uh oh!

QiJune left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yuantailing commented Jan 29, 2026

Uh oh!

tensorrt-cicd commented Jan 29, 2026

Uh oh!

yuantailing commented Jan 27, 2026 •

edited

Loading

coderabbitai bot commented Jan 27, 2026 •

edited

Loading