Skip to content

Comments

[None][feat] Hang detection for executor loop and worker.#10480

Merged
yuxianq merged 7 commits intoNVIDIA:mainfrom
yuxianq:hang-detector
Jan 13, 2026
Merged

[None][feat] Hang detection for executor loop and worker.#10480
yuxianq merged 7 commits intoNVIDIA:mainfrom
yuxianq:hang-detector

Conversation

@yuxianq
Copy link
Collaborator

@yuxianq yuxianq commented Jan 7, 2026

Description

This PR adds some hang detection functionality. We will print all thread stacks to show which function call gets stuck when:

  1. default/overlap/pp executor loop cannot finish one iteration within 300s by default
  2. When TRTLLM_WORKER_PRINT_STACKS_PERIOD is set, print thread stacks periodically on workers, it is designed for hang detection during e2e tests
  3. When TRTLLM_PRINT_STACKS_PERIOD is set, print thread stacks periodically on main process, it is designed for hang detection during unit tests

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
@yuxianq yuxianq requested review from QiJune and Superjomn January 7, 2026 05:24
@yuxianq yuxianq requested review from a team as code owners January 7, 2026 05:24
@yuxianq
Copy link
Collaborator Author

yuxianq commented Jan 7, 2026

/bot run --disable-fail-fast

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 7, 2026

📝 Walkthrough

Walkthrough

Introduces a hang detection system for TensorRT-LLM executors using an asyncio-based HangDetector that monitors for execution hangs in a background thread. The detector is integrated into PyExecutor and ExecutorRequestQueue, with checkpointing at key execution points. Includes utility functions for periodic stack printing across initialization and worker processes.

Changes

Cohort / File(s) Change Summary
Hang Detection Core
tensorrt_llm/_torch/pyexecutor/hang_detector.py
New HangDetector class with asyncio event loop in background thread; monitors for hangs using async tasks with configurable timeout; provides checkpoint/pause/stop lifecycle methods; invokes on_detected callback and prints stack traces when hang detected.
Hang Detection Integration
tensorrt_llm/_torch/pyexecutor/py_executor.py, tensorrt_llm/_torch/pyexecutor/executor_request_queue.py
Adds hang_detection_timeout parameter to PyExecutor constructor; instantiates HangDetector with error-signaling callback; passes detector to ExecutorRequestQueue; inserts checkpoint and stop calls around fetch, broadcast, and synchronization points; checks detected status during shutdown.
Stack Tracing Utilities
tensorrt_llm/_utils.py, tensorrt_llm/_common.py, tensorrt_llm/executor/worker.py
Adds print_all_stacks() function using sys._current_frames(); creates background daemon threads for periodic stack printing controlled by TRTLLM_PRINT_STACKS_PERIOD and TRTLLM_WORKER_PRINT_STACKS_PERIOD environment variables; logging integrated at initialization and worker startup.

Sequence Diagram(s)

sequenceDiagram
    participant PyExec as PyExecutor
    participant ReqQueue as ExecutorRequestQueue
    participant Detector as HangDetector
    participant AsyncLoop as Async Event Loop<br/>(Background Thread)
    participant Callback as on_detected<br/>Callback

    Note over PyExec,Callback: Hang Detection Flow

    PyExec->>Detector: start() on warmup
    Detector->>AsyncLoop: Create event loop in daemon thread

    loop Each Execution Iteration
        PyExec->>ReqQueue: fetch_requests()
        ReqQueue->>Detector: checkpoint()
        Detector->>AsyncLoop: schedule _detect_hang() task
        AsyncLoop->>AsyncLoop: async sleep(timeout)

        alt Execution Completes in Time
            ReqQueue-->>PyExec: requests ready
            PyExec->>Detector: checkpoint() at next point
            Detector->>AsyncLoop: cancel prior task, schedule new
        else Timeout Expires
            AsyncLoop->>AsyncLoop: mark hang detected
            AsyncLoop->>AsyncLoop: log error, print stacks
            AsyncLoop->>Callback: invoke callback
            Callback->>PyExec: signal error, set shutdown
            PyExec->>PyExec: early exit, skip worker wait
        end
    end

    PyExec->>Detector: stop() on shutdown
    Detector->>AsyncLoop: cancel tasks, stop loop, join thread
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 34.62% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description check ⚠️ Warning PR description is incomplete. The Description section lacks detail about implementation, and Test Coverage section is entirely empty. Add implementation details to Description explaining how hang detection works, and provide specific test cases in Test Coverage section to validate the hang detection functionality.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main feature being added: hang detection for executor loop and worker, following the template format correctly.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tensorrt_llm/_common.py (1)

1-14: Update copyright header year to reflect latest modification

This file now has new runtime behavior but the SPDX copyright line still ends at 2024. Per the project guidelines (“year of latest meaningful modification”), please bump the final year (e.g., to 2025/2026) so the header matches current changes.

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

373-384: Checkpoint/stop usage is good; ensure HangDetector is stopped on all exit paths

The new checkpoint() calls at the top of each executor loop and the stop() calls when should_stop_processing or scheduled_batch is None are well placed and give you a clear notion of “no progress for N seconds” during normal operation.

Two lifecycle gaps to address:

  1. Stop detector on exceptional exits

    _event_loop_wrapper wraps self.event_loop() and always invokes _executor_loop_cleanup() in finally, but it never stops the HangDetector. If self.event_loop() exits early due to an exception before it reaches the hang_detector.stop() sites in the loops, the detector’s loop thread will keep running indefinitely for the lifetime of the process.

    You can make cleanup robust by always stopping the detector in the wrapper:

    Proposed fix in _event_loop_wrapper
    def _event_loop_wrapper(self):
        try:
            with customized_gc_thresholds(
                    self.garbage_collection_gen0_threshold):
                self.event_loop()
        except Exception as e:
            logger.error(f"Error in event loop: {e}")
            logger.error(traceback.format_exc())
            raise e
        finally:
  •      self._executor_loop_cleanup()
    
  •      # Ensure hang detector loop is torn down even on exceptions.
    
  •      self.hang_detector.stop()
    
  •      self._executor_loop_cleanup()
    
    </details>
    
    `stop()` is idempotent relative to the explicit calls in the happy-path breaks.
    
    
  1. Shutdown path after a detected hang

    In shutdown(), you enqueue_shutdown_request() and wait() on shutdown_event, then:

    if self.hang_detector.detected():
        return
    self.worker_thread.join()

    Skipping join() when a hang is detected makes sense to avoid blocking forever, but it also means hang_detector.stop() is never invoked in that scenario unless it has already been called from inside the loop. With the change above in _event_loop_wrapper, you’ll ensure the detector is torn down even when the event loop exits via an internal error, and the early-return here becomes purely about not joining a wedged worker.

Overall, the hang-detection checkpoints/stop calls in the loops look good; tightening the cleanup as above will avoid leaving detector threads alive after error or hang conditions.

Also applies to: 397-405, 483-491, 969-988, 1345-1363, 1548-1568

🤖 Fix all issues with AI agents
In @tensorrt_llm/_torch/pyexecutor/hang_detector.py:
- Around line 1-21: Add the standard NVIDIA SPDX/copyright header at the top of
this file (matching the header used in other modules like _utils.py, updating
the year range to include this change), and change the timeout assignment in
HangDetector.__init__ so that only timeout is None triggers the default; replace
"self.timeout = timeout or 300" with an explicit None check (e.g., if timeout is
None: self.timeout = 300 else: self.timeout = timeout) so callers can pass 0 or
negative values intentionally and PyExecutor's hang_detection_timeout=None
remains unambiguous; update any type hints/docstring if needed to reflect the
semantics.

In @tensorrt_llm/_torch/pyexecutor/py_executor.py:
- Around line 49-50: The HangDetector integration currently starts monitoring
when hang_detection_timeout is None and runs heavy cleanup from the detector
thread; change PyExecutor so that HangDetector is only constructed/started when
hang_detection_timeout is not None (making None mean “disabled”), and modify the
on_detected callback in PyExecutor to perform only minimal, thread-safe
signaling: set self.is_shutdown = True, call self.shutdown_event.set(), and log
an error; remove direct calls to self._handle_errors(...) or heavy resource
mutations from the on_detected callback so the main thread or executor shutdown
path observes is_shutdown/shutdown_event and performs _handle_errors and full
cleanup under the normal threading/event-loop assumptions (apply to on_detected
usages and construction sites involving HangDetector, hang_detection_timeout,
_handle_errors, shutdown_event, and is_shutdown).
🧹 Nitpick comments (3)
tensorrt_llm/executor/worker.py (1)

3-4: Worker stack-printing thread is fine; minor robustness/DRY opportunities

The periodic stack-printing daemon in worker_main is functionally sound and mirrors the pattern used in _common._init, and being a daemon thread avoids shutdown blocking.

Two small suggestions:

  • print_stacks_period = int(os.getenv("TRTLLM_WORKER_PRINT_STACKS_PERIOD", "-1")) will raise ValueError if the env var is set to a non-integer; if misconfiguration is expected in the wild, consider wrapping this in a try/except with a safe fallback.
  • The _print_stacks loop is now duplicated between _common._init and worker_main; if this pattern grows, consider a shared helper (e.g., start_stack_printer_thread(env_var_name: str, label: str)) in _utils to avoid divergence.

Also applies to: 14-15, 158-172

tensorrt_llm/_common.py (1)

19-20: Library-wide stack-printing daemon is fine; same minor nits as worker

The _print_stacks helper and associated daemon thread in _init correctly gate on TRTLLM_PRINT_STACKS_PERIOD and reuse print_all_stacks(). Behavior is appropriate for low-frequency diagnostics and won’t block shutdown because the thread is daemonized.

Minor suggestions (same as in worker.py):

  • print_stacks_period = int(os.getenv("TRTLLM_PRINT_STACKS_PERIOD", "-1")) will raise if the env var is non-numeric; consider a small try/except ValueError with a warning and disabling the feature instead of aborting init.
  • _print_stacks is now duplicated between _common._init and worker_main; consider a shared helper to centralize this behavior if you plan to evolve it further.

Also applies to: 38-39, 86-97

tensorrt_llm/_torch/pyexecutor/hang_detector.py (1)

22-77: HangDetector core logic is sound; consider minor robustness tweaks

The overall design (own asyncio loop in a dedicated daemon thread, checkpoint() reset, pause() for cancel, stop() to cancel all tasks and stop the loop) is reasonable and low overhead.

A couple of small improvements you might consider:

  • In _detect_hang, cancellations from pause() will surface as CancelledError from asyncio.sleep. Explicitly catching asyncio.CancelledError and returning early can avoid any chance of noisy task-exception logs on some Python versions.
  • If there’s a risk of start() being called twice, you may want to guard against reinitializing loop/loop_thread while an old loop is still alive (e.g., by no-op’ing or raising if self.loop is not None).

These are non-blocking, but they’ll make the detector a bit more bulletproof across executor lifecycle edge cases.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a34aa63 and 41bf8fe.

📒 Files selected for processing (6)
  • tensorrt_llm/_common.py
  • tensorrt_llm/_torch/pyexecutor/executor_request_queue.py
  • tensorrt_llm/_torch/pyexecutor/hang_detector.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tensorrt_llm/_utils.py
  • tensorrt_llm/executor/worker.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces. Do not use tabs
Always maintain the namespace when importing Python modules, even if only one class or function from a module is used
Python filenames should use snake_case (e.g., some_file.py)
Python classes should use PascalCase (e.g., class SomeClass)
Python functions and methods should use snake_case (e.g., def my_awesome_function():)
Python local variables should use snake_case, with prefix k for variable names that start with a number (e.g., k_99th_percentile)
Python global variables should use upper snake_case with prefix G (e.g., G_MY_GLOBAL)
Python constants should use upper snake_case (e.g., MY_CONSTANT)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Use comments in Python for code within a function, or interfaces that are local to a file
Use Google-style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with the format """<type>: Description"""
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of errors possible
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block for the main logic

Files:

  • tensorrt_llm/executor/worker.py
  • tensorrt_llm/_common.py
  • tensorrt_llm/_torch/pyexecutor/executor_request_queue.py
  • tensorrt_llm/_torch/pyexecutor/hang_detector.py
  • tensorrt_llm/_utils.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
**/*.{cpp,cc,cxx,h,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification

Files:

  • tensorrt_llm/executor/worker.py
  • tensorrt_llm/_common.py
  • tensorrt_llm/_torch/pyexecutor/executor_request_queue.py
  • tensorrt_llm/_torch/pyexecutor/hang_detector.py
  • tensorrt_llm/_utils.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
🧠 Learnings (3)
📚 Learning: 2025-09-02T13:42:44.885Z
Learnt from: pcastonguay
Repo: NVIDIA/TensorRT-LLM PR: 7455
File: tensorrt_llm/_torch/pyexecutor/py_executor.py:1852-1860
Timestamp: 2025-09-02T13:42:44.885Z
Learning: In MPI communication within TensorRT-LLM pipeline parallelism, different communication types (tokens, logits, termination sync) must use disjoint tag namespaces to avoid message routing collisions when using the same source/destination patterns.

Applied to files:

  • tensorrt_llm/_common.py
📚 Learning: 2025-12-12T03:27:08.565Z
Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 9655
File: tensorrt_llm/_torch/pyexecutor/sampler.py:3031-3031
Timestamp: 2025-12-12T03:27:08.565Z
Learning: In files under tensorrt_llm/_torch/pyexecutor, avoid accessing torch.Tensor objects inside for-loops when iterating over requests. Convert batched tensors to Python lists beforehand using tensor.tolist(), and then iterate over those lists. This improves performance by reducing tensor-bound operations inside hot loops. Apply this pattern to similar code paths that process batches to access simple Python data structures (lists) inside loops.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/executor_request_queue.py
  • tensorrt_llm/_torch/pyexecutor/hang_detector.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
🧬 Code graph analysis (5)
tensorrt_llm/executor/worker.py (2)
tensorrt_llm/_utils.py (3)
  • mpi_comm (506-507)
  • mpi_rank (540-547)
  • print_all_stacks (766-770)
tensorrt_llm/_common.py (1)
  • _print_stacks (86-92)
tensorrt_llm/_common.py (1)
tensorrt_llm/_utils.py (2)
  • print_all_stacks (766-770)
  • str_dtype_to_trt (247-257)
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (1)
tensorrt_llm/_torch/pyexecutor/hang_detector.py (3)
  • HangDetector (9-77)
  • pause (53-57)
  • checkpoint (47-51)
tensorrt_llm/_torch/pyexecutor/hang_detector.py (2)
tensorrt_llm/_utils.py (1)
  • print_all_stacks (766-770)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
  • on_detected (282-286)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
tensorrt_llm/_torch/pyexecutor/hang_detector.py (5)
  • HangDetector (9-77)
  • start (22-32)
  • detected (42-45)
  • checkpoint (47-51)
  • stop (59-77)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tensorrt_llm/_utils.py (1)

24-27: Stack dumping utility is correct and appropriately scoped

print_all_stacks() cleanly uses sys._current_frames() + traceback.format_stack and logs via the shared logger. No correctness or concurrency issues from this helper; good reuse point for the rest of the PR.

Also applies to: 766-771

tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (2)

17-18: ExecutorRequestQueue HangDetector wiring is consistent

Accepting an optional hang_detector and defaulting to HangDetector() when None keeps the queue reusable while allowing PyExecutor to inject a shared instance. Because the queue itself never calls start(), there’s no risk of a hidden background loop; lifecycle remains with the owner.

No changes requested here.

Also applies to: 51-61, 76-78


285-318: Pausing hang detection around queue wait and non-root broadcast is reasonable

Bracketing _get_from_request_queue(timeout) and the non-root _broadcast_new_requests call with pause()/checkpoint() ensures:

  • Long idle periods waiting on new requests don’t trigger false-positive hangs.
  • A fresh timeout window starts after each successful fetch/broadcast.

Given HangDetector is started by PyExecutor and is otherwise inert, this integration looks safe and low-overhead.

Also applies to: 487-495

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30842 [ run ] triggered by Bot. Commit: 41bf8fe

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30842 [ run ] completed with state SUCCESS. Commit: 41bf8fe
/LLM/main/L0_MergeRequest_PR pipeline #23817 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@yuxianq
Copy link
Collaborator Author

yuxianq commented Jan 8, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30999 [ run ] triggered by Bot. Commit: 14c6408

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30999 [ run ] completed with state SUCCESS. Commit: 14c6408
/LLM/main/L0_MergeRequest_PR pipeline #23951 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@yuxianq
Copy link
Collaborator Author

yuxianq commented Jan 9, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31176 [ run ] triggered by Bot. Commit: dd7ed8a

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31176 [ run ] completed with state SUCCESS. Commit: dd7ed8a
/LLM/main/L0_MergeRequest_PR pipeline #24089 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@yuxianq
Copy link
Collaborator Author

yuxianq commented Jan 10, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31303 [ run ] triggered by Bot. Commit: dd7ed8a

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31303 [ run ] completed with state SUCCESS. Commit: dd7ed8a
/LLM/main/L0_MergeRequest_PR pipeline #24197 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@yuxianq yuxianq requested a review from chzblych January 12, 2026 06:50
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
@yuxianq yuxianq requested a review from Superjomn January 12, 2026 08:11
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Copy link
Collaborator

@Superjomn Superjomn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yuxianq yuxianq requested a review from Funatiq January 12, 2026 09:37
Copy link
Collaborator

@Funatiq Funatiq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC we are not sure when the hangs happen, so this PR adds logging to help with that. I think that's fine to improve debugging.

Would it make sense to change CUDA event syncs to looping event queries to detect hangs? Then we could probably shut down more cleanly. We don't need to do this in this PR.

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
@yuxianq yuxianq requested a review from Funatiq January 12, 2026 10:54
@yuxianq
Copy link
Collaborator Author

yuxianq commented Jan 13, 2026

/bot reuse-pipeline

@yuxianq yuxianq enabled auto-merge (squash) January 13, 2026 07:05
@tensorrt-cicd
Copy link
Collaborator

PR_Github #31727 [ reuse-pipeline ] triggered by Bot. Commit: ad7fb64

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31727 [ reuse-pipeline ] completed with state SUCCESS. Commit: ad7fb64
Reusing PR_Github #31303 for commit ad7fb64

@yuxianq yuxianq merged commit 04b1126 into NVIDIA:main Jan 13, 2026
5 checks passed
videodanchik pushed a commit to videodanchik/TensorRT-LLM that referenced this pull request Jan 14, 2026
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Daniil Kulko <kulkodaniil@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants