[TRTLLM-10666][chore] Refactor request fetching logic for better separation of concerns#10988
Conversation
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #33557 [ run ] triggered by Bot. Commit: |
|
PR_Github #33557 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #33586 [ run ] triggered by Bot. Commit: |
|
PR_Github #33586 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #33596 [ run ] triggered by Bot. Commit: |
|
PR_Github #33596 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #33644 [ run ] triggered by Bot. Commit: |
|
PR_Github #33644 [ run ] completed with state |
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #33678 [ run ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughA major refactoring of the executor request queue system, decoupling complex request handling logic from ExecutorRequestQueue into a new standalone utility module. The ExecutorRequestQueue class is significantly simplified to focus on basic queue operations, while PyExecutor now orchestrates request fetching, validation, and broadcasting through a waiting queue workflow. A new request_utils module provides distributed scheduling, context partitioning, and request merging functionality for various CP/PP/TP configurations. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~70 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
1-11: Add the standard NVIDIA copyright header.This source file should include the NVIDIA header (2026 as the latest modification year).
As per coding guidelines, please add the standard NVIDIA header above the imports.
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (1)
78-80: UseOptional[List[int]]for Python 3.8+ compatibility.The pipe union syntax (
List[int] | None) requires Python 3.10+, but the project targets Python 3.8+. Replace withOptional[List[int]], which is already imported fromtyping.Fix
- def _generate_child_request_ids( - self, request: ExecutorRequest) -> List[int] | None: + def _generate_child_request_ids( + self, request: ExecutorRequest) -> Optional[List[int]]:
🤖 Fix all issues with AI agents
In `@tensorrt_llm/_torch/pyexecutor/executor_request_queue.py`:
- Around line 1-7: Add the standard NVIDIA copyright/header (with 2026 as the
latest modification year) to the top of the file, placing it above the existing
import block (i.e., above the line that starts with "import dataclasses");
ensure the header matches the project's standard format and includes the
copyright notice, years, and any required license wording.
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 2201-2210: The new_active_requests_queue_latency_ms metric is
being accumulated forever because _update_new_active_requests_queue_latency uses
+=; change the behavior so the per-iteration latency isn't cumulative: either
assign the computed latency directly to new_active_requests_queue_latency_ms in
_update_new_active_requests_queue_latency (replace += with =) or have
_get_new_active_requests_queue_latency return the current value and reset
new_active_requests_queue_latency_ms to 0 after reading. Locate the methods
_update_new_active_requests_queue_latency and
_get_new_active_requests_queue_latency and adjust the handling of
new_active_requests_queue_latency_ms (and keep the call to
executor_request_queue.calculate_queue_latency(new_requests, now) intact).
- Around line 9-12: Add Deque to the typing import list (alongside Callable,
Dict, Iterable, List, Optional, Tuple, Union) and replace all uses of
subscripted deque[...] with Deque[...] to keep Python 3.8+ compatibility;
specifically change the class attribute annotated with deque[RequestQueueItem]
and all method parameters typed as deque[RequestQueueItem] (references to
RequestQueueItem and any variables like request_queue) to use
Deque[RequestQueueItem] instead.
In `@tensorrt_llm/_torch/pyexecutor/request_utils.py`:
- Around line 100-108: The sorting currently prioritizes relaxed requests
because sorted(new_requests, key=get_relax_value, reverse=True) reverses the key
order; change the sort so non-relaxed requests come first by either removing
reverse=True or inverting the key logic in get_relax_value (e.g., return not
scheduling_params.attention_dp_relax when py_scheduling_params exists); update
the sorted call that references get_relax_value/new_requests accordingly so
behavior matches the comment.
- Around line 542-547: The call to executor_request_to_llm_request is passing
positional args that misalign parameters (exclude_last_generation_logits is
being used as child_req_ids and ctx_blocks_list becomes the exclude flag);
update the call to use keyword args so arguments are explicit—call
executor_request_to_llm_request(req_id=req_id, exe_req=exe_req,
child_req_ids=<appropriate_child_req_ids>,
exclude_last_generation_logits=exclude_last_generation_logits,
ctx_blocks_list=ctx_blocks_list) (replace <appropriate_child_req_ids> with the
actual child request ids variable) so child_req_ids is passed explicitly and
ctx_blocks_list is used for the context blocks.
- Around line 536-541: The code currently uses a print followed by "assert
False, 'invalid context partition'" which can be disabled under -O; replace this
runtime check with raising a proper exception (e.g., ValueError or RuntimeError)
in the same spot to enforce validation regardless of optimization flags: remove
the assert and raise an error including cp_rank, len(ctx_blocks), and
num_blocks_per_rank to preserve the diagnostic info printed earlier. Locate the
failing check in request_utils.py (the print + assert sequence) and change it to
raise a descriptive exception (e.g., raise ValueError(f"invalid context
partition: rank={cp_rank}, len(ctx_blocks)={len(ctx_blocks)},
num_blocks_per_rank={num_blocks_per_rank}")) so invalid partitions cannot
proceed.
- Around line 1-2: Add the standard NVIDIA copyright header (with 2026 as the
latest modification year) immediately above the module docstring in
tensorrt_llm/_torch/pyexecutor/request_utils.py; place the full multiline header
block before the existing first line (the triple-quoted docstring) so the module
starts with the NVIDIA header followed by the current docstring.
In `@tests/unittest/_torch/executor/test_executor_request_queue.py`:
- Around line 1-8: Add the standard NVIDIA copyright header (updated to 2026)
above the module docstring at the top of the file; insert the multi-line header
comment block before the existing triple-quoted docstring so the file begins
with the NVIDIA header followed by the module docstring that documents
ExecutorRequestQueue tests.
In `@tests/unittest/_torch/executor/test_request_utils.py`:
- Around line 739-741: The test loops use unused loop variables and an ambiguous
Unicode multiplication sign; update the for-loops to use a single underscore for
unused iterators (replace "for rank in range(4):" / "for i in range(5):" with
"for _ in range(...):" or rename the unused variable to "_") and replace the
Unicode "×" character with a plain ASCII identifier or operator (e.g., "x" or
"*" as appropriate) wherever it appears (notably in the blocks referenced around
the loops and the similar occurrence near the other block at lines ~1060-1062)
so Ruff B007 and RUF003 are resolved.
|
PR_Github #33678 [ run ] completed with state
|
Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
Signed-off-by: Liao Lanyu <108499334+lancelly@users.noreply.github.com>
Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
…TensorRT-LLM into refactor_request_queue
|
/bot run --disable-fail-fast |
|
PR_Github #34033 [ run ] triggered by Bot. Commit: |
|
PR_Github #34033 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #34079 [ run ] triggered by Bot. Commit: |
|
PR_Github #34079 [ run ] completed with state |
Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #34213 [ run ] triggered by Bot. Commit: |
|
PR_Github #34213 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #34232 [ run ] triggered by Bot. Commit: |
|
PR_Github #34232 [ run ] completed with state |
Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #34362 [ run ] triggered by Bot. Commit: |
|
PR_Github #34362 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #34375 [ run ] triggered by Bot. Commit: |
|
PR_Github #34375 [ run ] completed with state |
Description
Summary
This PR refactors the request fetching and queue management logic, separating responsibilities between
ExecutorRequestQueueandPyExecutor. This is the first step of a larger refactoring effort — in follow-up PRs,waiting_queuemanagement will be moved to the new scheduler.Changes
1.
ExecutorRequestQueue— Simplified to pure queue operationsrequest_queuemanagement only2.
PyExecutor— Takes over request processing orchestrationwaiting_queue(previously inExecutorRequestQueue)_fetch_new_requestsas unified entry point:_fetch_and_enqueue_requests_select_requests_for_tp/_select_requests_for_dp_update_new_active_requests_queue_latencymerge_requests3.
request_utils.py— Extracted pure functionsget_from_waiting_queuemerge_requestsattach_py_objects_to_requestsNext Steps
waiting_queuemanagement to the new schedulerschedule_local_active— Schedule existing active requestsfetch_new_requests_to_waiting_queue— Fetch and enqueue new requestsassign_waiting_requests— Assign requests from waiting queueschedule_new_active_requests— Schedule newly activated requestsSummary by CodeRabbit
Release Notes
Refactor
New Features
Tests
✏️ Tip: You can customize this high-level summary in your review settings.