Skip to content

Comments

[None][fix] Respect CUDA_LAUNCH_BLOCKING by fixing doCheckError#11261

Merged
mikeiovine merged 2 commits intoNVIDIA:mainfrom
hnover-nv:fix_doCheckError
Feb 9, 2026
Merged

[None][fix] Respect CUDA_LAUNCH_BLOCKING by fixing doCheckError#11261
mikeiovine merged 2 commits intoNVIDIA:mainfrom
hnover-nv:fix_doCheckError

Conversation

@hnover-nv
Copy link
Collaborator

@hnover-nv hnover-nv commented Feb 4, 2026

Description

The cpp macro sync_check_cuda_error(stream) is intended to conditionally synchronize the thread to the current stream and then check for any cuda errors, based on user control and build type. Namely, the user can set the environement variable CUDA_LAUNCH_BLOCKING to 0 or 1, and it should be respected. If the variable is unset, then we should sync and check for debug builds but not for other builds. (Additionally, we never sync and check if we're in the midst of capturing a cuda graph).

To avoid constantly looking up the environment variable, the existing code used a thread local variable firstCall to run the lookup once. But a conditional was inverted, so the lookup never ran, and we only followed the build-type behavior. Even if it did run, although the code returns an optional, once it ran it would have always set true or false, losing the "environement variable not set" behavior.

Here we fix all that, simplifying to using a lambda-initialized const static variable instead of the thread local and combining two functions.

Test Coverage

There do not seem to be any existing tests that directly exercise it. There are two test decorators that reference it:

  • triton_kernels/testing.py compute_sanitizer decorator should exercise it, but either we don't use that decorator or nothing terrible happens when the env var is ignored, either because synchronization isn't needed or those are debug builds where we did synchronize.
  • Tests using the assert_no_cuda_sync decorator in tests/unittest/utils/util.py will skip if the environement variable is set to true.

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • [ x] Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@coderabbitai coderabbitai bot changed the title @coderabbitai title [refactor] Cache CUDA_LAUNCH_BLOCKING with lazy init Feb 4, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 4, 2026

📝 Walkthrough

Walkthrough

Refactored error checking logic in cudaUtils.h by removing the public isCudaLaunchBlocking() function and replacing it with a static, lazily-initialized doCheckIfNotCapturing flag that caches the CUDA_LAUNCH_BLOCKING environment variable value to eliminate per-call overhead.

Changes

Cohort / File(s) Summary
Error Checking Refactor
cpp/include/tensorrt_llm/common/cudaUtils.h
Removed isCudaLaunchBlocking() public function. Refactored doCheckError(cudaStream_t) to use a static, lazily-initialized doCheckIfNotCapturing flag computed once from CUDA_LAUNCH_BLOCKING env var instead of checking on each call. Final error checking now combines this cached flag with isCapturing(stream) result while preserving debug/non-debug build-type behavior differences.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: fixing doCheckError to respect the CUDA_LAUNCH_BLOCKING environment variable.
Description check ✅ Passed The PR description is well-structured and follows the template, explaining the bug, solution, test coverage, and checklist.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cpp/include/tensorrt_llm/common/cudaUtils.h (1)

2-2: ⚠️ Potential issue | 🟡 Minor

Update copyright year to include 2026.

The copyright header shows 2022-2024, but this file is being meaningfully modified in 2026. As per coding guidelines, the copyright header should reflect the year of latest meaningful modification.

- * SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
🧹 Nitpick comments (1)
cpp/include/tensorrt_llm/common/cudaUtils.h (1)

166-178: Static variable naming should use 's' prefix.

Per coding guidelines, locally visible static variables should use camelCase prefixed with 's'. The variable doCheckIfNotCapturing should be renamed to sDoCheckIfNotCapturing.

📝 Proposed fix
-    static bool const doCheckIfNotCapturing = []()
+    static bool const sDoCheckIfNotCapturing = []()
     {
         char const* env = std::getenv("CUDA_LAUNCH_BLOCKING");
         if (env != nullptr)
         {
             return std::string(env) == "1";
         }
 `#ifndef` NDEBUG
         return true;
 `#else`
         return false;
 `#endif`
     }();
-    return doCheckIfNotCapturing && !isCapturing(stream);
+    return sDoCheckIfNotCapturing && !isCapturing(stream);

@hnover-nv hnover-nv changed the title [refactor] Cache CUDA_LAUNCH_BLOCKING with lazy init [None][fix] Respect CUDA_LAUNCH_BLOCKING by fixing doCheckError Feb 4, 2026
@hnover-nv
Copy link
Collaborator Author

@coderabbitai apply the suggested copyright and variable naming fixes

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 4, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Ensure PR `#11261` is checked out =="
gh pr checkout 11261 -R NVIDIA/TensorRT-LLM >/dev/null

echo "== Basic repo context =="
git rev-parse --show-toplevel
git branch --show-current
git status -s

echo "== List changed files in PR =="
gh pr diff 11261 -R NVIDIA/TensorRT-LLM --name-only

echo "== Save full diff to pr_diff.patch (for reference) =="
gh pr diff 11261 -R NVIDIA/TensorRT-LLM > pr_diff.patch
wc -l pr_diff.patch || true

echo "== Fetch review threads and comments JSON =="
gh pr view 11261 -R NVIDIA/TensorRT-LLM --json reviewThreads,comments > pr_comments.json
jq '{
  reviewThreads: (.reviewThreads // []) | length,
  comments: (.comments // []) | length
}' pr_comments.json

echo "== Extract any suggestions or notes mentioning copyright or variable naming =="
jq -r '
  [
    .reviewThreads[]?.comments[]?.body? // empty,
    .comments[]?.body? // empty
  ]
  | map(select(test("(?i)copyright|variable|naming|style|identifier|Hungarian")))
  | .[]' pr_comments.json || true

echo "== Show the changed file header and key functions (cudaUtils.h) if present =="
if [ -f cpp/include/tensorrt_llm/common/cudaUtils.h ]; then
  echo "--- Header (first 40 lines) ---"
  sed -n '1,40p' cpp/include/tensorrt_llm/common/cudaUtils.h || true

  echo "--- doCheckError occurrences ---"
  rg -n -C3 --type=cpp '\bdoCheckError\s*\(' cpp/include/tensorrt_llm/common/cudaUtils.h || true

  echo "--- CUDA_LAUNCH_BLOCKING occurrences ---"
  rg -n -C3 --type=cpp 'CUDA_LAUNCH_BLOCKING' cpp/include/tensorrt_llm/common/cudaUtils.h || true

  echo "--- static const flag names around doCheckError ---"
  rg -n -C3 --type=cpp 'static\s+const|const\s+static' cpp/include/tensorrt_llm/common/cudaUtils.h || true
else
  echo "File cpp/include/tensorrt_llm/common/cudaUtils.h not found in working tree."
fi

Length of output: 1110


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Fetch reviews and comments with available fields =="
gh pr view 11261 -R NVIDIA/TensorRT-LLM --json reviews,comments,latestReviews > pr_reviews.json

echo "== Extract review comments mentioning copyright, variable, naming, style =="
jq -r '
  [
    (.reviews[]?.body? // empty),
    (.latestReviews[]?.body? // empty),
    (.comments[]?.body? // empty)
  ]
  | map(select(. != null and . != ""))
  | map(select(test("(?i)copyright|variable|naming|style|identifier|Hungarian|doCheck|static|const")))
  | .[]' pr_reviews.json || echo "No matching comments found in JSON"

echo ""
echo "== Read the PR diff to identify changes =="
cat pr_diff.patch

echo ""
echo "== Full content of cudaUtils.h on PR branch =="
cat cpp/include/tensorrt_llm/common/cudaUtils.h

Length of output: 62744


@hnover-nv thanks — applying both suggestions below.

  • Update SPDX header year to 2026.
  • Rename the static local flag in doCheckError() to follow the s-prefixed static naming convention.

✏️ Learnings added
Learnt from: hnover-nv
URL: https://github.com/NVIDIA/TensorRT-LLM/pull/11261

Timestamp: 2026-02-04T16:48:11.405Z
Learning: TensorRT-LLM C++ style: Locally scoped static variables (e.g., inside functions) use camelCase with an 's' prefix, e.g., sDoCheckIfNotCapturing in cpp/include/tensorrt_llm/common/cudaUtils.h.

Learnt from: hnover-nv
URL: https://github.com/NVIDIA/TensorRT-LLM/pull/11261

Timestamp: 2026-02-04T16:48:11.405Z
Learning: SPDX header policy: The SPDX-FileCopyrightText year range in C++ headers should include the latest meaningful modification year (e.g., 2022-2026 for edits made in 2026).

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: CR
Repo: NVIDIA/TensorRT-LLM PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2026-01-06T03:07:15.754Z
Learning: Applies to **/*.{cpp,cc,cxx,h,hpp,hxx,cu,cuh} : Do not use Hungarian notation in C++ except for 'apps hungarian' (e.g., 'nb' to indicate count: `mNbLayers`)

💡 Use @coderabbitai plan to create a plan for code changes

@hnover-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34837 [ run ] triggered by Bot. Commit: 06cefcc

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34837 [ run ] completed with state SUCCESS. Commit: 06cefcc
/LLM/main/L0_MergeRequest_PR pipeline #26875 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@hnover-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34910 [ run ] triggered by Bot. Commit: 06cefcc

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34910 [ run ] completed with state FAILURE. Commit: 06cefcc
/LLM/main/L0_MergeRequest_PR pipeline #26929 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@hnover-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34919 [ run ] triggered by Bot. Commit: 07debb0

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34919 [ run ] completed with state SUCCESS. Commit: 07debb0
/LLM/main/L0_MergeRequest_PR pipeline #26936 completed with status: 'SUCCESS'

@hnover-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #35009 [ run ] triggered by Bot. Commit: 92c796a

Copy link
Collaborator

@mikeiovine mikeiovine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for cleaning it up

@tensorrt-cicd
Copy link
Collaborator

PR_Github #35009 [ run ] completed with state FAILURE. Commit: 92c796a
/LLM/main/L0_MergeRequest_PR pipeline #27012 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@hnover-nv
Copy link
Collaborator Author

/bot run

1 similar comment
@hnover-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #35031 [ run ] triggered by Bot. Commit: 8ffa120

@tensorrt-cicd
Copy link
Collaborator

PR_Github #35031 [ run ] completed with state ABORTED. Commit: 8ffa120
LLM/main/L0_MergeRequest_PR #27034 (Blue Ocean) completed with status: ABORTED

@hnover-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #35142 [ run ] triggered by Bot. Commit: 4b18a76

@tensorrt-cicd
Copy link
Collaborator

PR_Github #35142 [ run ] completed with state SUCCESS. Commit: 4b18a76
/LLM/main/L0_MergeRequest_PR pipeline #27133 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Signed-off-by: Harris Nover <249353502+hnover-nv@users.noreply.github.com>
@hnover-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #35200 [ run ] triggered by Bot. Commit: 6daad3f

@tensorrt-cicd
Copy link
Collaborator

PR_Github #35200 [ run ] completed with state SUCCESS. Commit: 6daad3f
/LLM/main/L0_MergeRequest_PR pipeline #27188 completed with status: 'SUCCESS'

@hnover-nv
Copy link
Collaborator Author

/bot skip --comment "passed last night, accidentally updated branch instead of merging, but don't need to retest"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #35243 Bot args parsing error: Failed to parse bot args

@hnover-nv
Copy link
Collaborator Author

/bot skip --comment "updated to main by accident, already passed"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #35244 [ skip ] triggered by Bot. Commit: dac963b

@tensorrt-cicd
Copy link
Collaborator

PR_Github #35244 [ skip ] completed with state SUCCESS. Commit: dac963b
Skipping testing for commit dac963b

@mikeiovine mikeiovine merged commit 100bfdc into NVIDIA:main Feb 9, 2026
5 checks passed
@hnover-nv hnover-nv deleted the fix_doCheckError branch February 9, 2026 19:34
inciaf pushed a commit to inciaf/trtllm-energy-monitoring that referenced this pull request Feb 18, 2026
…IA#11261)

Signed-off-by: Harris Nover <249353502+hnover-nv@users.noreply.github.com>
Signed-off-by: Ahmet Inci <ainci@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants