Skip to content

Comments

[TRTLLM-10666][chore] Refactor request fetching logic for better separation of concerns#10988

Merged
lancelly merged 15 commits intoNVIDIA:mainfrom
lancelly:refactor_request_queue
Feb 2, 2026
Merged

[TRTLLM-10666][chore] Refactor request fetching logic for better separation of concerns#10988
lancelly merged 15 commits intoNVIDIA:mainfrom
lancelly:refactor_request_queue

Conversation

@lancelly
Copy link
Collaborator

@lancelly lancelly commented Jan 26, 2026

Description

Summary

This PR refactors the request fetching and queue management logic, separating responsibilities between ExecutorRequestQueue and PyExecutor. This is the first step of a larger refactoring effort — in follow-up PRs, waiting_queue management will be moved to the new scheduler.

Changes

1. ExecutorRequestQueue — Simplified to pure queue operations

  • Focused on request_queue management only

2. PyExecutor — Takes over request processing orchestration

  • Owns waiting_queue (previously in ExecutorRequestQueue)
  • Restructured _fetch_new_requests as unified entry point:
Step Method Description
1 _fetch_and_enqueue_requests fetch + broadcast + enqueue
2 _select_requests_for_tp / _select_requests_for_dp select from waiting queue
3 _update_new_active_requests_queue_latency update performance metrics
4 merge_requests convert to LlmRequest

3. request_utils.py — Extracted pure functions

  • get_from_waiting_queue
  • merge_requests
  • attach_py_objects_to_requests

Next Steps

  • Move waiting_queue management to the new scheduler
  • Refactor IFB schedule flow to:
    1. schedule_local_active — Schedule existing active requests
    2. fetch_new_requests_to_waiting_queue — Fetch and enqueue new requests
    3. assign_waiting_requests — Assign requests from waiting queue
    4. schedule_new_active_requests — Schedule newly activated requests

Summary by CodeRabbit

Release Notes

  • Refactor

    • Simplified request queue architecture and operations for cleaner request handling
    • Restructured request scheduling and distribution workflows across distributed ranks
  • New Features

    • Enhanced request scheduling and balancing across tensor-parallel processing units
    • Improved latency tracking and performance metrics for queued requests
    • New request broadcasting and synchronization capabilities for multi-rank environments
  • Tests

    • Added comprehensive test coverage for request utilities and executor operations

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
@lancelly lancelly changed the title [chore][None] Refactor request queue [TRTLLM-10666][chore] Refactor request queue Jan 26, 2026
@lancelly lancelly changed the title [TRTLLM-10666][chore] Refactor request queue [TRTLLM-10666][chore] Refactor request fetching logic for better separation of concerns Jan 26, 2026
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
@lancelly
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #33557 [ run ] triggered by Bot. Commit: 91d0ccd

@tensorrt-cicd
Copy link
Collaborator

PR_Github #33557 [ run ] completed with state SUCCESS. Commit: 91d0ccd
/LLM/main/L0_MergeRequest_PR pipeline #25884 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@lancelly
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #33586 [ run ] triggered by Bot. Commit: 91d0ccd

@tensorrt-cicd
Copy link
Collaborator

PR_Github #33586 [ run ] completed with state FAILURE. Commit: 91d0ccd
/LLM/main/L0_MergeRequest_PR pipeline #25907 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@lancelly
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #33596 [ run ] triggered by Bot. Commit: 91d0ccd

@tensorrt-cicd
Copy link
Collaborator

PR_Github #33596 [ run ] completed with state SUCCESS. Commit: 91d0ccd
/LLM/main/L0_MergeRequest_PR pipeline #25916 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@lancelly
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #33644 [ run ] triggered by Bot. Commit: 91d0ccd

@tensorrt-cicd
Copy link
Collaborator

PR_Github #33644 [ run ] completed with state SUCCESS. Commit: 91d0ccd
/LLM/main/L0_MergeRequest_PR pipeline #25956 completed with status: 'SUCCESS'

Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
@lancelly
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #33678 [ run ] triggered by Bot. Commit: 8cedd88

@lancelly lancelly marked this pull request as ready for review January 27, 2026 07:33
@lancelly lancelly requested a review from a team as a code owner January 27, 2026 07:33
@lancelly lancelly requested a review from Naveassaf January 27, 2026 07:33
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 27, 2026

📝 Walkthrough

Walkthrough

A major refactoring of the executor request queue system, decoupling complex request handling logic from ExecutorRequestQueue into a new standalone utility module. The ExecutorRequestQueue class is significantly simplified to focus on basic queue operations, while PyExecutor now orchestrates request fetching, validation, and broadcasting through a waiting queue workflow. A new request_utils module provides distributed scheduling, context partitioning, and request merging functionality for various CP/PP/TP configurations.

Changes

Cohort / File(s) Summary
ExecutorRequestQueue Simplification
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py
Removed internal data structures for MPI/PP handling, waiting queues, and complex scheduling. Removed constructor parameters enable_attention_dp and max_num_active_requests. Renamed internal _get_from_request_queue() to public get_from_request_queue(). Added public accessors: get_request_queue_size(), get_request_queue(), and calculate_queue_latency(). Delegated child request ID generation to external get_num_child_requests() utility.
PyExecutor Request Workflow
tensorrt_llm/_torch/pyexecutor/py_executor.py
Added new request fetching and enqueuing layer with waiting_queue deque to hold pending requests before scheduling. Introduced methods: _fetch_and_enqueue_requests(), _pop_from_waiting_queue(), _fetch_new_requests(), _schedule_attention_dp_requests(), _handle_special_queue_items(), _update_new_active_requests_queue_latency(), _get_new_active_requests_queue_latency(), _should_exclude_last_generation_logits(). Added internal state trackers: canceled_req_ids, control_requests, request_accumulated, new_active_requests_queue_latency_ms, _disable_mpi, request_broadcaster. Updated control flow to route through new waiting queue.
Request Utilities (New Module)
tensorrt_llm/_torch/pyexecutor/request_utils.py
New comprehensive module for distributed request processing with utilities for: computing child request counts, collecting/attaching Python objects, scheduling attention DP requests with heap-based balancing, context partitioning for Star Attention and Helix CP modes, merging requests into LlmRequest objects, and a RequestBroadcaster class for distributed broadcasting with hang-detection support.
ExecutorRequestQueue Test Updates
tests/unittest/_torch/executor/test_executor_request_queue.py
Updated tests to reflect simplified constructor signature (removed enable_attention_dp, max_num_active_requests parameters) and public API rename (get_from_request_queue()). Removed tests for legacy internal queue structures and specialized behaviors.
PyExecutor Test Suite (New)
tests/unittest/_torch/executor/test_py_executor.py
Added MockPyExecutor class and tests for: _handle_special_queue_items() behavior with shutdown/cancel/control/normal requests, update_waiting_queue() pruning logic, canceled request ID tracking, and queue state queries (size, latency, active counts). Includes fixtures for mock distribution and executor.
Request Utils Test Suite (New)
tests/unittest/_torch/executor/test_request_utils.py
Comprehensive test suite covering: context partitioning for Star Attention and Helix, request merging with padding/block handling, attention DP scheduling and filtering, request balancing across ranks with heap-based distribution, and integration tests for complete scheduling workflows.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~70 minutes

Possibly related PRs

  • [TRTLLM-5271][feat] best_of/n for pytorch workflow #5997: Shares code-level connections around child-request generation and executor request-queue logic; this PR refactors/removes internal child-request handling and delegates to request_utils, complementing the prior PR's child-request implementation and propagation.

Suggested reviewers

  • HuiGao-NV
  • Funatiq
  • QiJune
  • litaotju
  • nv-guomingz
  • jaedeok-nvidia
🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 61.47% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: refactoring request fetching logic for better separation of concerns between ExecutorRequestQueue and PyExecutor.
Description check ✅ Passed The PR description is comprehensive and well-structured, covering summary, detailed changes across three files, and clear next steps. It provides sufficient context for reviewers.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

1-11: Add the standard NVIDIA copyright header.

This source file should include the NVIDIA header (2026 as the latest modification year).

As per coding guidelines, please add the standard NVIDIA header above the imports.

tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (1)

78-80: Use Optional[List[int]] for Python 3.8+ compatibility.

The pipe union syntax (List[int] | None) requires Python 3.10+, but the project targets Python 3.8+. Replace with Optional[List[int]], which is already imported from typing.

Fix
-    def _generate_child_request_ids(
-            self, request: ExecutorRequest) -> List[int] | None:
+    def _generate_child_request_ids(
+            self, request: ExecutorRequest) -> Optional[List[int]]:
🤖 Fix all issues with AI agents
In `@tensorrt_llm/_torch/pyexecutor/executor_request_queue.py`:
- Around line 1-7: Add the standard NVIDIA copyright/header (with 2026 as the
latest modification year) to the top of the file, placing it above the existing
import block (i.e., above the line that starts with "import dataclasses");
ensure the header matches the project's standard format and includes the
copyright notice, years, and any required license wording.

In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 2201-2210: The new_active_requests_queue_latency_ms metric is
being accumulated forever because _update_new_active_requests_queue_latency uses
+=; change the behavior so the per-iteration latency isn't cumulative: either
assign the computed latency directly to new_active_requests_queue_latency_ms in
_update_new_active_requests_queue_latency (replace += with =) or have
_get_new_active_requests_queue_latency return the current value and reset
new_active_requests_queue_latency_ms to 0 after reading. Locate the methods
_update_new_active_requests_queue_latency and
_get_new_active_requests_queue_latency and adjust the handling of
new_active_requests_queue_latency_ms (and keep the call to
executor_request_queue.calculate_queue_latency(new_requests, now) intact).
- Around line 9-12: Add Deque to the typing import list (alongside Callable,
Dict, Iterable, List, Optional, Tuple, Union) and replace all uses of
subscripted deque[...] with Deque[...] to keep Python 3.8+ compatibility;
specifically change the class attribute annotated with deque[RequestQueueItem]
and all method parameters typed as deque[RequestQueueItem] (references to
RequestQueueItem and any variables like request_queue) to use
Deque[RequestQueueItem] instead.

In `@tensorrt_llm/_torch/pyexecutor/request_utils.py`:
- Around line 100-108: The sorting currently prioritizes relaxed requests
because sorted(new_requests, key=get_relax_value, reverse=True) reverses the key
order; change the sort so non-relaxed requests come first by either removing
reverse=True or inverting the key logic in get_relax_value (e.g., return not
scheduling_params.attention_dp_relax when py_scheduling_params exists); update
the sorted call that references get_relax_value/new_requests accordingly so
behavior matches the comment.
- Around line 542-547: The call to executor_request_to_llm_request is passing
positional args that misalign parameters (exclude_last_generation_logits is
being used as child_req_ids and ctx_blocks_list becomes the exclude flag);
update the call to use keyword args so arguments are explicit—call
executor_request_to_llm_request(req_id=req_id, exe_req=exe_req,
child_req_ids=<appropriate_child_req_ids>,
exclude_last_generation_logits=exclude_last_generation_logits,
ctx_blocks_list=ctx_blocks_list) (replace <appropriate_child_req_ids> with the
actual child request ids variable) so child_req_ids is passed explicitly and
ctx_blocks_list is used for the context blocks.
- Around line 536-541: The code currently uses a print followed by "assert
False, 'invalid context partition'" which can be disabled under -O; replace this
runtime check with raising a proper exception (e.g., ValueError or RuntimeError)
in the same spot to enforce validation regardless of optimization flags: remove
the assert and raise an error including cp_rank, len(ctx_blocks), and
num_blocks_per_rank to preserve the diagnostic info printed earlier. Locate the
failing check in request_utils.py (the print + assert sequence) and change it to
raise a descriptive exception (e.g., raise ValueError(f"invalid context
partition: rank={cp_rank}, len(ctx_blocks)={len(ctx_blocks)},
num_blocks_per_rank={num_blocks_per_rank}")) so invalid partitions cannot
proceed.
- Around line 1-2: Add the standard NVIDIA copyright header (with 2026 as the
latest modification year) immediately above the module docstring in
tensorrt_llm/_torch/pyexecutor/request_utils.py; place the full multiline header
block before the existing first line (the triple-quoted docstring) so the module
starts with the NVIDIA header followed by the current docstring.

In `@tests/unittest/_torch/executor/test_executor_request_queue.py`:
- Around line 1-8: Add the standard NVIDIA copyright header (updated to 2026)
above the module docstring at the top of the file; insert the multi-line header
comment block before the existing triple-quoted docstring so the file begins
with the NVIDIA header followed by the module docstring that documents
ExecutorRequestQueue tests.

In `@tests/unittest/_torch/executor/test_request_utils.py`:
- Around line 739-741: The test loops use unused loop variables and an ambiguous
Unicode multiplication sign; update the for-loops to use a single underscore for
unused iterators (replace "for rank in range(4):" / "for i in range(5):" with
"for _ in range(...):" or rename the unused variable to "_") and replace the
Unicode "×" character with a plain ASCII identifier or operator (e.g., "x" or
"*" as appropriate) wherever it appears (notably in the blocks referenced around
the loops and the similar occurrence near the other block at lines ~1060-1062)
so Ruff B007 and RUF003 are resolved.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #33678 [ run ] completed with state SUCCESS. Commit: 8cedd88
/LLM/main/L0_MergeRequest_PR pipeline #25981 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
Signed-off-by: Liao Lanyu <108499334+lancelly@users.noreply.github.com>
Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
@lancelly
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34033 [ run ] triggered by Bot. Commit: 9c8ae7b

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34033 [ run ] completed with state SUCCESS. Commit: 9c8ae7b
/LLM/main/L0_MergeRequest_PR pipeline #26258 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@lancelly
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34079 [ run ] triggered by Bot. Commit: 9c8ae7b

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34079 [ run ] completed with state SUCCESS. Commit: 9c8ae7b
/LLM/main/L0_MergeRequest_PR pipeline #26296 completed with status: 'SUCCESS'

Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
@lancelly
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34213 [ run ] triggered by Bot. Commit: 19518b2

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34213 [ run ] completed with state ABORTED. Commit: 19518b2
LLM/main/L0_MergeRequest_PR #26400 (Blue Ocean) completed with status: ABORTED

@chzblych
Copy link
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34232 [ run ] triggered by Bot. Commit: 19518b2

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34232 [ run ] completed with state SUCCESS. Commit: 19518b2
/LLM/main/L0_MergeRequest_PR pipeline #26405 completed with status: 'SUCCESS'

Copy link
Collaborator

@QiJune QiJune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
@lancelly
Copy link
Collaborator Author

lancelly commented Feb 1, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34362 [ run ] triggered by Bot. Commit: 21742cf

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34362 [ run ] completed with state SUCCESS. Commit: 21742cf
/LLM/main/L0_MergeRequest_PR pipeline #26508 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@lancelly
Copy link
Collaborator Author

lancelly commented Feb 1, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34375 [ run ] triggered by Bot. Commit: 21742cf

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34375 [ run ] completed with state SUCCESS. Commit: 21742cf
/LLM/main/L0_MergeRequest_PR pipeline #26521 completed with status: 'SUCCESS'
Pipeline has performance regression cases. Check the performance regression report for details.

@lancelly lancelly merged commit fef0e4b into NVIDIA:main Feb 2, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants