Skip to content

Comments

[#11318][infra] AutoDeploy: Add fused rope kernel - triton_rope_on_interleaved_qk_inputs#11327

Merged
bmarimuthu-nv merged 5 commits intoNVIDIA:mainfrom
nv-auto-deploy:bala/triton_rope_on_interleaved_qk_inputs
Feb 17, 2026
Merged

[#11318][infra] AutoDeploy: Add fused rope kernel - triton_rope_on_interleaved_qk_inputs#11327
bmarimuthu-nv merged 5 commits intoNVIDIA:mainfrom
nv-auto-deploy:bala/triton_rope_on_interleaved_qk_inputs

Conversation

@bmarimuthu-nv
Copy link
Collaborator

@bmarimuthu-nv bmarimuthu-nv commented Feb 6, 2026

Summary by CodeRabbit

  • New Features

    • Added a fused RoPE operator and runtime kernel for interleaved query/key layouts, plus an optimizer path that replaces the original RoPE with this high-performance variant.
  • Tests

    • Added unit tests validating correctness across batch/sequence sizes, head configurations (including MQA), and dtypes (FP16/BF16).
  • Documentation

    • Documented the new fused interleaved Q/K RoPE operator in the custom-ops README.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@bmarimuthu-nv
Copy link
Collaborator Author

@CodeRabbit summary

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 6, 2026

✅ Actions performed

Summary regeneration triggered.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 6, 2026

📝 Walkthrough

Walkthrough

Adds a Triton JIT kernel and public op to apply RoPE on interleaved Q/K (position lookup + de-interleave + rotation), integrates an optimizer path to replace FX nodes with the new op, and includes kernel, wrapper, tests, and README entry.

Changes

Cohort / File(s) Summary
Triton Kernel Implementation
tensorrt_llm/_torch/auto_deploy/custom_ops/triton_kernels/rope.py
Added rope_fwd_interleaved_kernel Triton JIT kernel for interleaved Q/K: loads position cos/sin (two halves), de-interleaves Q/K, applies RoPE (y_first = acos - bsin; y_second = bcos + asin), writes q/k outputs in half-split layout, handles masking and grid/stride math.
Public API Wrapper & Registration
tensorrt_llm/_torch/auto_deploy/custom_ops/triton_rope.py
Added apply_rope_on_interleaved_qk_inputs and fake variant; validates inputs, allocates outputs, computes grid/strides/metadata, launches rope_fwd_interleaved_kernel, and registers the op as auto_deploy::triton_rope_on_interleaved_qk_inputs.
Transformation / Optimizer Integration
tensorrt_llm/_torch/auto_deploy/transform/library/rope.py
Added _trace_back_index, _validate_interleaved_rope_inputs, and _optimize_interleaved; integrated interleaved RoPE optimization into OptimizeRope to replace torch_rope_with_qk_interleaving with the Triton op when inputs validate.
Unit Tests — Kernel & Dtypes
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/triton_kernels/test_rope.py
Added _precompute_cos_sin_cache helper and tests: test_triton_rope_on_interleaved_qk_inputs (parametrized over batch/seq/heads/head-dim) and test_triton_rope_interleaved_dtype_consistency (FP16, BF16) comparing Triton outputs vs. PyTorch reference.
Unit Tests — Transformation
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_rope_transformation.py
Added test_optimize_interleaved_rope (parametrized for MHA/MQA) to assert FX graph contains original op pre-optimization and is replaced by triton_rope_on_interleaved_qk_inputs, then runs correctness checks on transformed GM.
Docs / Registry
tensorrt_llm/_torch/auto_deploy/custom_ops/README.md
Added README entry for torch.ops.auto_deploy.triton_rope_on_interleaved_qk_inputs describing the fused interleaved Q/K RoPE operator.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Caller
participant Wrapper as apply_rope_on_interleaved_qk_inputs
participant Triton as rope_fwd_interleaved_kernel
participant Cache as cos/sin cache
Caller->>Wrapper: call(q, k, cos_cache, sin_cache, position_ids)
Wrapper->>Triton: launch kernel (grid, strides, metadata)
Triton->>Cache: load cos/sin for positions (two halves)
Triton->>Triton: de-interleave Q/K; compute y_first/y_second (RoPE)
Triton->>Wrapper: write q_out, k_out
Wrapper->>Caller: return q_out, k_out

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main change: adding a fused RoPE kernel for interleaved Q/K inputs to the AutoDeploy infrastructure.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@bmarimuthu-nv bmarimuthu-nv force-pushed the bala/triton_rope_on_interleaved_qk_inputs branch from 6dc88b4 to 1d2475f Compare February 6, 2026 00:32
@bmarimuthu-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #35032 [ run ] triggered by Bot. Commit: 1d2475f

@bmarimuthu-nv bmarimuthu-nv marked this pull request as ready for review February 6, 2026 00:51
@bmarimuthu-nv bmarimuthu-nv requested a review from a team as a code owner February 6, 2026 00:51
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/triton_kernels/rope.py`:
- Around line 131-146: The kernel assumes contiguous [B,S,H,D] layout by
computing q_out_stride_b, q_out_stride_s, q_out_stride_h and then
q_out_offsets_first/second and uses tl.store into q_out_ptr; this breaks if the
output tensor is non-contiguous. Fix by adding explicit output-stride kernel
parameters (e.g., q_out_stride_b, q_out_stride_s, q_out_stride_h or a
q_out_strides tuple) and replace the hardcoded stride calculations with those
passed-in strides when computing q_out_offsets_first and q_out_offsets_second,
then use tl.store(q_out_ptr + q_out_offsets_..., ...) as before; alternatively
ensure the caller allocates a contiguous output and document/validate that in
the kernel entry (e.g., assert contiguous) if you prefer the caller-side fix.
- Around line 5-188: Add the missing NVIDIA copyright header to the top of this
file: insert the standard NVIDIA header block (with the correct year of latest
meaningful modification) as the first lines of
tensorrt_llm/_torch/auto_deploy/custom_ops/triton_kernels/rope.py so the file
that defines rope_fwd_interleaved_kernel contains the required header; ensure
the header format matches other TensorRT-LLM source files and precedes the
definition of rope_fwd_interleaved_kernel and any imports or Triton kernel code.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/triton_rope.py`:
- Around line 143-228: Add the required NVIDIA copyright header to the top of
tensorrt_llm/_torch/auto_deploy/custom_ops/triton_rope.py (above the
apply_rope_on_interleaved_qk_inputs definition and any imports) so the file
includes the standard NVIDIA header block used across the repo (with correct
year and organization attribution and any SPDX/license identifier), preserving
existing file contents and formatting; ensure the header appears before symbols
like apply_rope_on_interleaved_qk_inputs and any triton/torch imports.
- Around line 173-196: Add strict input validation before grid computation in
triton_rope.py: assert position_ids.shape == (B, S) to ensure position_ids
matches q/k batch and sequence dims; assert k.shape[:2] == q.shape[:2] so Q and
K share B and S; and either assert H_Q >= H_K with a clear error (e.g., "H_Q
must be >= H_K") or change the grid calculation to use max(H_Q, H_K) (replace
triton.cdiv(H_Q, BLOCK_SIZE_H) with triton.cdiv(max(H_Q, H_K), BLOCK_SIZE_H)) so
heads in K beyond H_Q are processed; also keep the existing cos/sin cache shape
assertion and raise informative messages on failure.
🧹 Nitpick comments (1)
tensorrt_llm/_torch/auto_deploy/custom_ops/triton_kernels/rope.py (1)

116-129: RoPE computation not promoted to fp32, unlike sibling kernels.

The existing rope_fwd_kernel (line 250) and rope_fwd_flattened_kernel (line 337) explicitly cast loaded values to tl.float32 before performing RoPE math. This new kernel operates in the input dtype (e.g., bf16/fp16), which may reduce numerical accuracy for the multiply-subtract/add operations.

Consider promoting q_a, q_b, cos, and sin to tl.float32 before computing q_y1/q_y2 (and similarly for K), then casting back when storing.

@bmarimuthu-nv
Copy link
Collaborator Author

@CodeRabbit summary

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 6, 2026

✅ Actions performed

Summary regeneration triggered.

@bmarimuthu-nv
Copy link
Collaborator Author

/bot run

1 similar comment
@bmarimuthu-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #35052 [ run ] triggered by Bot. Commit: 0b3f736

@tensorrt-cicd
Copy link
Collaborator

PR_Github #35052 [ run ] completed with state ABORTED. Commit: 0b3f736
LLM/main/L0_MergeRequest_PR #27052 (Blue Ocean) completed with status: ABORTED

@bmarimuthu-nv bmarimuthu-nv force-pushed the bala/triton_rope_on_interleaved_qk_inputs branch from 0b3f736 to 3799161 Compare February 6, 2026 18:14
@bmarimuthu-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #35135 [ run ] triggered by Bot. Commit: 3799161

@tensorrt-cicd
Copy link
Collaborator

PR_Github #35135 [ run ] completed with state FAILURE. Commit: 3799161
/LLM/main/L0_MergeRequest_PR pipeline #27128 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@bmarimuthu-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #35158 [ run ] triggered by Bot. Commit: 426157c

@tensorrt-cicd
Copy link
Collaborator

PR_Github #35158 [ run ] completed with state SUCCESS. Commit: 426157c
/LLM/main/L0_MergeRequest_PR pipeline #27149 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@bmarimuthu-nv bmarimuthu-nv force-pushed the bala/triton_rope_on_interleaved_qk_inputs branch from 426157c to dee5f09 Compare February 9, 2026 18:34
@bmarimuthu-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #35365 [ run ] triggered by Bot. Commit: dee5f09

@bmarimuthu-nv bmarimuthu-nv force-pushed the bala/triton_rope_on_interleaved_qk_inputs branch from dee5f09 to c08e4b7 Compare February 10, 2026 19:56
@bmarimuthu-nv
Copy link
Collaborator Author

/bot run

@bmarimuthu-nv bmarimuthu-nv enabled auto-merge (squash) February 10, 2026 20:02
@tensorrt-cicd
Copy link
Collaborator

PR_Github #35543 [ run ] triggered by Bot. Commit: c08e4b7

@tensorrt-cicd
Copy link
Collaborator

PR_Github #35543 [ run ] completed with state SUCCESS. Commit: c08e4b7
/LLM/main/L0_MergeRequest_PR pipeline #27448 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@taylor-yb-lee
Copy link
Collaborator

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #35966 [ run ] triggered by Bot. Commit: c08e4b7

@tensorrt-cicd
Copy link
Collaborator

PR_Github #35966 [ run ] completed with state SUCCESS. Commit: c08e4b7
/LLM/main/L0_MergeRequest_PR pipeline #27779 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

bmarimuthu-nv and others added 5 commits February 16, 2026 15:42
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
@suyoggupta suyoggupta force-pushed the bala/triton_rope_on_interleaved_qk_inputs branch from c08e4b7 to 83751b6 Compare February 16, 2026 23:50
@suyoggupta
Copy link
Collaborator

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #36062 [ run ] triggered by Bot. Commit: 83751b6

@tensorrt-cicd
Copy link
Collaborator

PR_Github #36062 [ run ] completed with state SUCCESS. Commit: 83751b6
/LLM/main/L0_MergeRequest_PR pipeline #27868 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@suyoggupta
Copy link
Collaborator

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

1 similar comment
@suyoggupta
Copy link
Collaborator

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #36066 [ run ] triggered by Bot. Commit: 83751b6

@tensorrt-cicd
Copy link
Collaborator

PR_Github #36066 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 12 AM PST on 2/17.

@suyoggupta
Copy link
Collaborator

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #36067 [ run ] triggered by Bot. Commit: 83751b6

@tensorrt-cicd
Copy link
Collaborator

PR_Github #36067 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 12 AM PST on 2/17.

@bmarimuthu-nv
Copy link
Collaborator Author

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #36080 [ run ] triggered by Bot. Commit: 83751b6

@tensorrt-cicd
Copy link
Collaborator

PR_Github #36080 [ run ] completed with state SUCCESS. Commit: 83751b6
/LLM/main/L0_MergeRequest_PR pipeline #27878 completed with status: 'SUCCESS'

@bmarimuthu-nv bmarimuthu-nv merged commit 6157f30 into NVIDIA:main Feb 17, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants