[None][feat] Hang detection for executor loop and worker.#10480

yuxianq · 2026-01-07T05:24:25Z

Description

This PR adds some hang detection functionality. We will print all thread stacks to show which function call gets stuck when:

default/overlap/pp executor loop cannot finish one iteration within 300s by default
When TRTLLM_WORKER_PRINT_STACKS_PERIOD is set, print thread stacks periodically on workers, it is designed for hang detection during e2e tests
When TRTLLM_PRINT_STACKS_PERIOD is set, print thread stacks periodically on main process, it is designed for hang detection during unit tests

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>

yuxianq · 2026-01-07T05:24:46Z

/bot run --disable-fail-fast

coderabbitai · 2026-01-07T05:31:43Z

📝 Walkthrough

Walkthrough

Introduces a hang detection system for TensorRT-LLM executors using an asyncio-based HangDetector that monitors for execution hangs in a background thread. The detector is integrated into PyExecutor and ExecutorRequestQueue, with checkpointing at key execution points. Includes utility functions for periodic stack printing across initialization and worker processes.

Changes

Cohort / File(s)	Change Summary
Hang Detection Core `tensorrt_llm/_torch/pyexecutor/hang_detector.py`	New HangDetector class with asyncio event loop in background thread; monitors for hangs using async tasks with configurable timeout; provides checkpoint/pause/stop lifecycle methods; invokes on_detected callback and prints stack traces when hang detected.
Hang Detection Integration `tensorrt_llm/_torch/pyexecutor/py_executor.py`, `tensorrt_llm/_torch/pyexecutor/executor_request_queue.py`	Adds hang_detection_timeout parameter to PyExecutor constructor; instantiates HangDetector with error-signaling callback; passes detector to ExecutorRequestQueue; inserts checkpoint and stop calls around fetch, broadcast, and synchronization points; checks detected status during shutdown.
Stack Tracing Utilities `tensorrt_llm/_utils.py`, `tensorrt_llm/_common.py`, `tensorrt_llm/executor/worker.py`	Adds print_all_stacks() function using sys._current_frames(); creates background daemon threads for periodic stack printing controlled by TRTLLM_PRINT_STACKS_PERIOD and TRTLLM_WORKER_PRINT_STACKS_PERIOD environment variables; logging integrated at initialization and worker startup.

Sequence Diagram(s)

sequenceDiagram
    participant PyExec as PyExecutor
    participant ReqQueue as ExecutorRequestQueue
    participant Detector as HangDetector
    participant AsyncLoop as Async Event Loop<br/>(Background Thread)
    participant Callback as on_detected<br/>Callback

    Note over PyExec,Callback: Hang Detection Flow

    PyExec->>Detector: start() on warmup
    Detector->>AsyncLoop: Create event loop in daemon thread

    loop Each Execution Iteration
        PyExec->>ReqQueue: fetch_requests()
        ReqQueue->>Detector: checkpoint()
        Detector->>AsyncLoop: schedule _detect_hang() task
        AsyncLoop->>AsyncLoop: async sleep(timeout)

        alt Execution Completes in Time
            ReqQueue-->>PyExec: requests ready
            PyExec->>Detector: checkpoint() at next point
            Detector->>AsyncLoop: cancel prior task, schedule new
        else Timeout Expires
            AsyncLoop->>AsyncLoop: mark hang detected
            AsyncLoop->>AsyncLoop: log error, print stacks
            AsyncLoop->>Callback: invoke callback
            Callback->>PyExec: signal error, set shutdown
            PyExec->>PyExec: early exit, skip worker wait
        end
    end

    PyExec->>Detector: stop() on shutdown
    Detector->>AsyncLoop: cancel tasks, stop loop, join thread

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 34.62% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Description check	⚠️ Warning	PR description is incomplete. The Description section lacks detail about implementation, and Test Coverage section is entirely empty.	Add implementation details to Description explaining how hang detection works, and provide specific test cases in Test Coverage section to validate the hang detection functionality.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main feature being added: hang detection for executor loop and worker, following the template format correctly.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

tensorrt_llm/_common.py (1)

1-14: Update copyright header year to reflect latest modification

This file now has new runtime behavior but the SPDX copyright line still ends at 2024. Per the project guidelines (“year of latest meaningful modification”), please bump the final year (e.g., to 2025/2026) so the header matches current changes.
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
373-384: Checkpoint/stop usage is good; ensure HangDetector is stopped on all exit paths

The new checkpoint() calls at the top of each executor loop and the stop() calls when should_stop_processing or scheduled_batch is None are well placed and give you a clear notion of “no progress for N seconds” during normal operation.

Two lifecycle gaps to address:
Stop detector on exceptional exits

_event_loop_wrapper wraps self.event_loop() and always invokes _executor_loop_cleanup() in finally, but it never stops the HangDetector. If self.event_loop() exits early due to an exception before it reaches the hang_detector.stop() sites in the loops, the detector’s loop thread will keep running indefinitely for the lifetime of the process.

You can make cleanup robust by always stopping the detector in the wrapper:
Proposed fix in _event_loop_wrapper
def _event_loop_wrapper(self):
    try:
        with customized_gc_thresholds(
                self.garbage_collection_gen0_threshold):
            self.event_loop()
    except Exception as e:
        logger.error(f"Error in event loop: {e}")
        logger.error(traceback.format_exc())
        raise e
    finally:
     self._executor_loop_cleanup()
     # Ensure hang detector loop is torn down even on exceptions.
     self.hang_detector.stop()
     self._executor_loop_cleanup()
</details>

`stop()` is idempotent relative to the explicit calls in the happy-path breaks.
Shutdown path after a detected hang

In shutdown(), you enqueue_shutdown_request() and wait() on shutdown_event, then:
if self.hang_detector.detected():
    return
self.worker_thread.join()
Skipping join() when a hang is detected makes sense to avoid blocking forever, but it also means hang_detector.stop() is never invoked in that scenario unless it has already been called from inside the loop. With the change above in _event_loop_wrapper, you’ll ensure the detector is torn down even when the event loop exits via an internal error, and the early-return here becomes purely about not joining a wedged worker.
Overall, the hang-detection checkpoints/stop calls in the loops look good; tightening the cleanup as above will avoid leaving detector threads alive after error or hang conditions.

Also applies to: 397-405, 483-491, 969-988, 1345-1363, 1548-1568

🤖 Fix all issues with AI agents

In @tensorrt_llm/_torch/pyexecutor/hang_detector.py:
- Around line 1-21: Add the standard NVIDIA SPDX/copyright header at the top of
this file (matching the header used in other modules like _utils.py, updating
the year range to include this change), and change the timeout assignment in
HangDetector.__init__ so that only timeout is None triggers the default; replace
"self.timeout = timeout or 300" with an explicit None check (e.g., if timeout is
None: self.timeout = 300 else: self.timeout = timeout) so callers can pass 0 or
negative values intentionally and PyExecutor's hang_detection_timeout=None
remains unambiguous; update any type hints/docstring if needed to reflect the
semantics.

In @tensorrt_llm/_torch/pyexecutor/py_executor.py:
- Around line 49-50: The HangDetector integration currently starts monitoring
when hang_detection_timeout is None and runs heavy cleanup from the detector
thread; change PyExecutor so that HangDetector is only constructed/started when
hang_detection_timeout is not None (making None mean “disabled”), and modify the
on_detected callback in PyExecutor to perform only minimal, thread-safe
signaling: set self.is_shutdown = True, call self.shutdown_event.set(), and log
an error; remove direct calls to self._handle_errors(...) or heavy resource
mutations from the on_detected callback so the main thread or executor shutdown
path observes is_shutdown/shutdown_event and performs _handle_errors and full
cleanup under the normal threading/event-loop assumptions (apply to on_detected
usages and construction sites involving HangDetector, hang_detection_timeout,
_handle_errors, shutdown_event, and is_shutdown).

🧹 Nitpick comments (3)

tensorrt_llm/executor/worker.py (1)

3-4: Worker stack-printing thread is fine; minor robustness/DRY opportunities

The periodic stack-printing daemon in worker_main is functionally sound and mirrors the pattern used in _common._init, and being a daemon thread avoids shutdown blocking.

Two small suggestions:

print_stacks_period = int(os.getenv("TRTLLM_WORKER_PRINT_STACKS_PERIOD", "-1")) will raise ValueError if the env var is set to a non-integer; if misconfiguration is expected in the wild, consider wrapping this in a try/except with a safe fallback.

The _print_stacks loop is now duplicated between _common._init and worker_main; if this pattern grows, consider a shared helper (e.g., start_stack_printer_thread(env_var_name: str, label: str)) in _utils to avoid divergence.

Also applies to: 14-15, 158-172

tensorrt_llm/_common.py (1)

19-20: Library-wide stack-printing daemon is fine; same minor nits as worker

The _print_stacks helper and associated daemon thread in _init correctly gate on TRTLLM_PRINT_STACKS_PERIOD and reuse print_all_stacks(). Behavior is appropriate for low-frequency diagnostics and won’t block shutdown because the thread is daemonized.

Minor suggestions (same as in worker.py):

print_stacks_period = int(os.getenv("TRTLLM_PRINT_STACKS_PERIOD", "-1")) will raise if the env var is non-numeric; consider a small try/except ValueError with a warning and disabling the feature instead of aborting init.

_print_stacks is now duplicated between _common._init and worker_main; consider a shared helper to centralize this behavior if you plan to evolve it further.

Also applies to: 38-39, 86-97

tensorrt_llm/_torch/pyexecutor/hang_detector.py (1)

22-77: HangDetector core logic is sound; consider minor robustness tweaks

The overall design (own asyncio loop in a dedicated daemon thread, checkpoint() reset, pause() for cancel, stop() to cancel all tasks and stop the loop) is reasonable and low overhead.

A couple of small improvements you might consider:

In _detect_hang, cancellations from pause() will surface as CancelledError from asyncio.sleep. Explicitly catching asyncio.CancelledError and returning early can avoid any chance of noisy task-exception logs on some Python versions.

If there’s a risk of start() being called twice, you may want to guard against reinitializing loop/loop_thread while an old loop is still alive (e.g., by no-op’ing or raising if self.loop is not None).

These are non-blocking, but they’ll make the detector a bit more bulletproof across executor lifecycle edge cases.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a34aa63 and 41bf8fe.

📒 Files selected for processing (6)

tensorrt_llm/_common.py
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py
tensorrt_llm/_torch/pyexecutor/hang_detector.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/_utils.py
tensorrt_llm/executor/worker.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces. Do not use tabs
Always maintain the namespace when importing Python modules, even if only one class or function from a module is used
Python filenames should use snake_case (e.g., some_file.py)
Python classes should use PascalCase (e.g., class SomeClass)
Python functions and methods should use snake_case (e.g., def my_awesome_function():)
Python local variables should use snake_case, with prefix k for variable names that start with a number (e.g., k_99th_percentile)
Python global variables should use upper snake_case with prefix G (e.g., G_MY_GLOBAL)
Python constants should use upper snake_case (e.g., MY_CONSTANT)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Use comments in Python for code within a function, or interfaces that are local to a file
Use Google-style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with the format """<type>: Description"""
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of errors possible
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block for the main logic

Files:

tensorrt_llm/executor/worker.py
tensorrt_llm/_common.py
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py
tensorrt_llm/_torch/pyexecutor/hang_detector.py
tensorrt_llm/_utils.py
tensorrt_llm/_torch/pyexecutor/py_executor.py

**/*.{cpp,cc,cxx,h,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification

Files:

tensorrt_llm/executor/worker.py
tensorrt_llm/_common.py
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py
tensorrt_llm/_torch/pyexecutor/hang_detector.py
tensorrt_llm/_utils.py
tensorrt_llm/_torch/pyexecutor/py_executor.py

🧠 Learnings (3)

📚 Learning: 2025-09-02T13:42:44.885Z

Learnt from: pcastonguay
Repo: NVIDIA/TensorRT-LLM PR: 7455
File: tensorrt_llm/_torch/pyexecutor/py_executor.py:1852-1860
Timestamp: 2025-09-02T13:42:44.885Z
Learning: In MPI communication within TensorRT-LLM pipeline parallelism, different communication types (tokens, logits, termination sync) must use disjoint tag namespaces to avoid message routing collisions when using the same source/destination patterns.

Applied to files:

tensorrt_llm/_common.py

📚 Learning: 2025-12-12T03:27:08.565Z

Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 9655
File: tensorrt_llm/_torch/pyexecutor/sampler.py:3031-3031
Timestamp: 2025-12-12T03:27:08.565Z
Learning: In files under tensorrt_llm/_torch/pyexecutor, avoid accessing torch.Tensor objects inside for-loops when iterating over requests. Convert batched tensors to Python lists beforehand using tensor.tolist(), and then iterate over those lists. This improves performance by reducing tensor-bound operations inside hot loops. Apply this pattern to similar code paths that process batches to access simple Python data structures (lists) inside loops.

Applied to files:

tensorrt_llm/_torch/pyexecutor/executor_request_queue.py
tensorrt_llm/_torch/pyexecutor/hang_detector.py
tensorrt_llm/_torch/pyexecutor/py_executor.py

📚 Learning: 2025-08-19T12:45:11.997Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

tensorrt_llm/_torch/pyexecutor/py_executor.py

🧬 Code graph analysis (5)

tensorrt_llm/executor/worker.py (2)

tensorrt_llm/_utils.py (3)

mpi_comm (506-507)

mpi_rank (540-547)

print_all_stacks (766-770)

tensorrt_llm/_common.py (1)

_print_stacks (86-92)

tensorrt_llm/_common.py (1)

tensorrt_llm/_utils.py (2)

print_all_stacks (766-770)

str_dtype_to_trt (247-257)

tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (1)

tensorrt_llm/_torch/pyexecutor/hang_detector.py (3)

HangDetector (9-77)

pause (53-57)

checkpoint (47-51)

tensorrt_llm/_torch/pyexecutor/hang_detector.py (2)

tensorrt_llm/_utils.py (1)

print_all_stacks (766-770)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

on_detected (282-286)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

tensorrt_llm/_torch/pyexecutor/hang_detector.py (5)

HangDetector (9-77)

start (22-32)

detected (42-45)

checkpoint (47-51)

stop (59-77)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (3)

tensorrt_llm/_utils.py (1)

24-27: Stack dumping utility is correct and appropriately scoped

print_all_stacks() cleanly uses sys._current_frames() + traceback.format_stack and logs via the shared logger. No correctness or concurrency issues from this helper; good reuse point for the rest of the PR.

Also applies to: 766-771

tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (2)

17-18: ExecutorRequestQueue HangDetector wiring is consistent

Accepting an optional hang_detector and defaulting to HangDetector() when None keeps the queue reusable while allowing PyExecutor to inject a shared instance. Because the queue itself never calls start(), there’s no risk of a hidden background loop; lifecycle remains with the owner.

No changes requested here.

Also applies to: 51-61, 76-78

285-318: Pausing hang detection around queue wait and non-root broadcast is reasonable

Bracketing _get_from_request_queue(timeout) and the non-root _broadcast_new_requests call with pause()/checkpoint() ensures:

Long idle periods waiting on new requests don’t trigger false-positive hangs.

A fresh timeout window starts after each successful fetch/broadcast.

Given HangDetector is started by PyExecutor and is otherwise inert, this integration looks safe and low-overhead.

Also applies to: 487-495

tensorrt_llm/_torch/pyexecutor/hang_detector.py

tensorrt_llm/_torch/pyexecutor/py_executor.py

tensorrt-cicd · 2026-01-07T05:34:50Z

PR_Github #30842 [ run ] triggered by Bot. Commit: 41bf8fe

tensorrt-cicd · 2026-01-07T14:38:12Z

PR_Github #30842 [ run ] completed with state SUCCESS. Commit: 41bf8fe
/LLM/main/L0_MergeRequest_PR pipeline #23817 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

yuxianq · 2026-01-08T05:07:34Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-08T05:14:41Z

PR_Github #30999 [ run ] triggered by Bot. Commit: 14c6408

tensorrt-cicd · 2026-01-08T12:51:08Z

PR_Github #30999 [ run ] completed with state SUCCESS. Commit: 14c6408
/LLM/main/L0_MergeRequest_PR pipeline #23951 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

yuxianq · 2026-01-09T03:01:10Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-09T03:17:06Z

PR_Github #31176 [ run ] triggered by Bot. Commit: dd7ed8a

tensorrt-cicd · 2026-01-09T23:30:50Z

PR_Github #31176 [ run ] completed with state SUCCESS. Commit: dd7ed8a
/LLM/main/L0_MergeRequest_PR pipeline #24089 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

yuxianq · 2026-01-10T00:50:01Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-10T00:55:46Z

PR_Github #31303 [ run ] triggered by Bot. Commit: dd7ed8a

tensorrt-cicd · 2026-01-10T10:41:17Z

PR_Github #31303 [ run ] completed with state SUCCESS. Commit: dd7ed8a
/LLM/main/L0_MergeRequest_PR pipeline #24197 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

tensorrt_llm/_torch/pyexecutor/executor_request_queue.py

tensorrt_llm/_torch/pyexecutor/hang_detector.py

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>

Superjomn

LGTM

Funatiq

IIUC we are not sure when the hangs happen, so this PR adds logging to help with that. I think that's fine to improve debugging.

Would it make sense to change CUDA event syncs to looping event queries to detect hangs? Then we could probably shut down more cleanly. We don't need to do this in this PR.

tensorrt_llm/_torch/pyexecutor/py_executor.py

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>

yuxianq · 2026-01-13T07:04:38Z

/bot reuse-pipeline

tensorrt-cicd · 2026-01-13T07:10:13Z

PR_Github #31727 [ reuse-pipeline ] triggered by Bot. Commit: ad7fb64

tensorrt-cicd · 2026-01-13T07:34:19Z

PR_Github #31727 [ reuse-pipeline ] completed with state SUCCESS. Commit: ad7fb64
Reusing PR_Github #31303 for commit ad7fb64

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Signed-off-by: Daniil Kulko <kulkodaniil@gmail.com>

hang detection for executor loop and worker.

41bf8fe

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>

yuxianq requested review from QiJune and Superjomn January 7, 2026 05:24

yuxianq requested review from a team as code owners January 7, 2026 05:24

coderabbitai bot reviewed Jan 7, 2026

View reviewed changes

tensorrt_llm/_torch/pyexecutor/hang_detector.py Show resolved Hide resolved

tensorrt_llm/_torch/pyexecutor/py_executor.py Show resolved Hide resolved

Merge branch 'main' into hang-detector

14c6408

Merge branch 'main' into hang-detector

dd7ed8a

yuxianq requested a review from chzblych January 12, 2026 06:50

Superjomn reviewed Jan 12, 2026

View reviewed changes

tensorrt_llm/_torch/pyexecutor/executor_request_queue.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/pyexecutor/hang_detector.py Outdated Show resolved Hide resolved

yuxianq added 2 commits January 12, 2026 08:07

Address comments.

49dd912

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>

Address comment.

4af00fd

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>

yuxianq requested a review from Superjomn January 12, 2026 08:11

Address comment.

2551e0b

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>

Superjomn approved these changes Jan 12, 2026

View reviewed changes

yuxianq requested a review from Funatiq January 12, 2026 09:37

Funatiq reviewed Jan 12, 2026

View reviewed changes

tensorrt_llm/_torch/pyexecutor/py_executor.py Show resolved Hide resolved

Address comments.

ad7fb64

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>

yuxianq requested a review from Funatiq January 12, 2026 10:54

Funatiq approved these changes Jan 12, 2026

View reviewed changes

yuxianq enabled auto-merge (squash) January 13, 2026 07:05

yuxianq merged commit 04b1126 into NVIDIA:main Jan 13, 2026
5 checks passed

Comments

Conversation

yuxianq commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

yuxianq commented Jan 7, 2026

Uh oh!

coderabbitai bot commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated Code Review Effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Jan 7, 2026

Uh oh!

tensorrt-cicd commented Jan 7, 2026

Uh oh!

yuxianq commented Jan 8, 2026

Uh oh!

tensorrt-cicd commented Jan 8, 2026

Uh oh!

tensorrt-cicd commented Jan 8, 2026

Uh oh!

yuxianq commented Jan 9, 2026

Uh oh!

tensorrt-cicd commented Jan 9, 2026

Uh oh!

tensorrt-cicd commented Jan 9, 2026

Uh oh!

yuxianq commented Jan 10, 2026

Uh oh!

tensorrt-cicd commented Jan 10, 2026

Uh oh!

tensorrt-cicd commented Jan 10, 2026

Uh oh!

Uh oh!

Uh oh!

Superjomn left a comment

Choose a reason for hiding this comment

Uh oh!

Funatiq left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yuxianq commented Jan 13, 2026

Uh oh!

tensorrt-cicd commented Jan 13, 2026

Uh oh!

tensorrt-cicd commented Jan 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yuxianq commented Jan 7, 2026 •

edited

Loading

coderabbitai bot commented Jan 7, 2026 •

edited

Loading