[None][feat] Visual Gen: add cuda graphs; torch compile; nvtx; warmup#11554
[None][feat] Visual Gen: add cuda graphs; torch compile; nvtx; warmup#11554chang-l merged 4 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
📝 WalkthroughWalkthroughThis PR extends the VisualGen pipeline with TorchCompile and CUDA graph acceleration capabilities. New CLI options and configuration fields enable these features across example scripts. A CUDA graph runner manages per-key graph lifecycles with warm-up routines. The pipeline orchestrates conditional torch compilation, CUDA graph setup, and NVTX profiling instrumentation. Changes
Sequence DiagramsequenceDiagram
participant User
participant Loader as pipeline_loader
participant Pipeline as BasePipeline
participant Compiler as torch.compile
participant CUDA as CUDAGraphRunner
participant Transformer as Transformer
participant Inference as Forward Pass
User->>Loader: Load with config
Loader->>Pipeline: Initialize pipeline
alt enable_torch_compile
Loader->>Pipeline: torch_compile()
Pipeline->>Compiler: Compile selected modules
Compiler-->>Pipeline: Compiled modules
end
alt enable_cuda_graph
Loader->>CUDA: Initialize CUDAGraphRunner
CUDA-->>Pipeline: Graph manager ready
end
Loader->>Pipeline: warmup(warmup_steps)
Pipeline->>Transformer: Run warmup inference
Transformer-->>Pipeline: Warm-up complete
alt enable_layerwise_nvtx_marker
Loader->>Transformer: Register NVTX hooks
end
Loader-->>User: Pipeline ready
User->>Pipeline: forward(prompt, ...)
Pipeline->>Pipeline: denoise loop (with `@nvtx_range`)
loop each denoise step
Pipeline->>CUDA: wrap(denoise_step)
alt graph exists for key
CUDA->>CUDA: replay graph
else new key
CUDA->>Transformer: capture graph
Transformer-->>CUDA: graph captured
CUDA->>CUDA: replay graph
end
CUDA-->>Pipeline: step output
end
Pipeline-->>User: Generate output
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
tensorrt_llm/_torch/visual_gen/pipeline.py (1)
1-13:⚠️ Potential issue | 🟡 MinorMissing NVIDIA copyright header.
This source file should include the NVIDIA Apache 2.0 copyright header per repository coding guidelines.
As per coding guidelines: "All source files must contain an NVIDIA copyright header with the year of latest meaningful modification."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/visual_gen/pipeline.py` around lines 1 - 13, This file is missing the required NVIDIA Apache-2.0 copyright header; add the standard NVIDIA header (including the correct year of latest meaningful modification and the Apache-2.0 license notice) at the top of the module before any imports so it precedes symbols like imports of torch, Mapping, CUDAGraphRunner, and classes/functions in this file (e.g., CUDAGraphRunner, CUDAGraphRunnerConfig, TeaCacheBackend) to comply with repository coding guidelines.tensorrt_llm/_torch/visual_gen/config.py (1)
195-211:⚠️ Potential issue | 🟡 Minor
enable_fullgraphis declared but never passed totorch.compile().The config field
enable_fullgraph(line 201) is defined and exposed as a CLI argument, butBasePipeline.torch_compile()inpipeline.py(lines 173–178) callstorch.compile(block, mode=compile_mode, dynamic=False)without passing thefullgraphparameter. The flag is silently ignored.Fix in `tensorrt_llm/_torch/visual_gen/pipeline.py` `torch_compile()` method
+ fullgraph = pipeline_config.enable_fullgraph + for name in pipeline_config.torch_compile_models: ... compiled_blocks.append( - torch.compile(block, mode=compile_mode, dynamic=False) + torch.compile(block, mode=compile_mode, dynamic=False, fullgraph=fullgraph) ) ... - compiled = torch.compile(model, mode=compile_mode, dynamic=False) + compiled = torch.compile(model, mode=compile_mode, dynamic=False, fullgraph=fullgraph)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/visual_gen/config.py` around lines 195 - 211, The PipelineConfig flag enable_fullgraph is never forwarded to torch.compile, so update BasePipeline.torch_compile() to pass the fullgraph argument from the config: when calling torch.compile(block, mode=compile_mode, dynamic=False) include fullgraph=self.config.enable_fullgraph (or equivalent access to PipelineConfig) so the user-specified flag is respected; ensure the call site in the BasePipeline.torch_compile method references PipelineConfig.enable_fullgraph and retains existing mode/dynamic behavior.tensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan.py (1)
1-17:⚠️ Potential issue | 🟡 MinorMissing NVIDIA copyright header.
This source file has no Apache 2.0 / NVIDIA copyright header. Per coding guidelines, all
.pysource files must contain one.As per coding guidelines: "All source files must contain an NVIDIA copyright header with the year of latest meaningful modification."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan.py` around lines 1 - 17, This file is missing the required NVIDIA/Apache-2.0 copyright header; add the standard NVIDIA copyright and Apache-2.0 license header at the very top of tensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan.py (above the imports) with the year of the latest meaningful modification, matching the project's header format used in other .py files; ensure the header remains a comment block and does not alter imports or symbols like AutoencoderKLWan, FlowMatchEulerDiscreteScheduler, BasePipeline, register_pipeline, etc.
🧹 Nitpick comments (6)
tensorrt_llm/_torch/visual_gen/cuda_graph_runner.py (2)
129-136: Redundantreplay()aftercapture()on first call.After
capture()runs the function and stores the output,replay()is called immediately with the same inputs — this replays the graph unnecessarily since the output is already stored. You can returnself.graph_outputs[key]directly.Suggested change
if key not in self.graphs: self.capture(key, fn, args, kwargs) - return self.replay(key, args, kwargs) - else: - return self.replay(key, args, kwargs) + return self.replay(key, args, kwargs)Or skip the replay entirely on first capture:
if key not in self.graphs: self.capture(key, fn, args, kwargs) + return self.graph_outputs[key] else: return self.replay(key, args, kwargs)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/visual_gen/cuda_graph_runner.py` around lines 129 - 136, The current flow in get_graph_key/use-site calls capture() then immediately replay() which is redundant; after calling self.capture(key, fn, args, kwargs) return the cached output from self.graph_outputs[key] instead of calling self.replay again. Update the branch that checks if key not in self.graphs to call self.capture(...) and then return self.graph_outputs[key], and simplify/remove the duplicate replay path so only the else path calls self.replay(key, args, kwargs).
77-82:gc.collect()+torch.cuda.empty_cache()inside the warmup loop is excessive.These are called every warmup iteration (default 2). Moving them after the loop would reduce overhead while still freeing memory before capture.
Suggested change
for _ in range(self.WARMUP_STEPS): fn(*static_args, **static_kwargs) torch.cuda.synchronize() - gc.collect() - torch.cuda.empty_cache() + + gc.collect() + torch.cuda.empty_cache() with torch.cuda.graph(graph, pool=self.memory_pool):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/visual_gen/cuda_graph_runner.py` around lines 77 - 82, The warmup loop currently calls gc.collect() and torch.cuda.empty_cache() every iteration which is unnecessary; instead, keep the loop that calls fn(*static_args, **static_kwargs) and torch.cuda.synchronize() for self.WARMUP_STEPS, then call gc.collect() and torch.cuda.empty_cache() once after the loop and just before creating/using torch.cuda.CUDAGraph() (or before capture) to reduce overhead while still freeing memory; locate the warmup loop that references self.WARMUP_STEPS, fn, and torch.cuda.CUDAGraph() and move the two cleanup calls out of the loop accordingly.tensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan.py (1)
245-278: Warmup implementation runs the full pipeline including text encoding and VAE decode.This is thorough but note it requires all standard components (tokenizer, text_encoder, VAE, scheduler) to be loaded, which means warmup will fail if any are in
skip_components. Consider adding a guard or at least a clear error message if a required component is missing during warmup.Also, the variant detection via substring match in
checkpoint_path(lines 260–264) is fragile — a path containing "480P" in a parent directory name could trigger an unintended match.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan.py` around lines 245 - 278, The _run_warmup method currently calls the full forward pipeline which will fail if components are missing; before running warmup verify required components (tokenizer, text_encoder, VAE/decoder, scheduler or whatever forward() needs) are present on self (or not listed in self.model_config.skip_components) and either skip warmup with a clear processLogger.error/ warning or raise a descriptive exception; also tighten the variant detection by checking the checkpoint filename or using a regex/boundary match instead of a plain substring on checkpoint_path (refer to checkpoint_path, variant_shapes and the loop that sets height/width/num_frames in _run_warmup) so parent directories containing “480P” don’t produce false matches.tensorrt_llm/_torch/visual_gen/pipeline.py (2)
40-57: CUDA graph + torch.compile mutual exclusion is handled correctly, but the earlyreturncould be clearer.The
returnat line 49 exits__init__entirely. Currently this only skips the CUDA graph runner setup (lines 51–57), which is the intended behavior. However, if future code is added after this block in__init__, it will be silently skipped. Consider restructuring to useelifor an early guard that doesn't exit the constructor.Suggested restructuring
if self.model_config.pipeline.enable_cuda_graph and self.transformer is not None: if self.model_config.pipeline.enable_torch_compile: logger.warning( "Cuda graphs with torch compile is not supported yet. " "Only using torch compile for better performance. " ) - return - - self.cuda_graph_runner = CUDAGraphRunner( - CUDAGraphRunnerConfig( - use_cuda_graph=model_config.pipeline.enable_cuda_graph, + else: + self.cuda_graph_runner = CUDAGraphRunner( + CUDAGraphRunnerConfig( + use_cuda_graph=model_config.pipeline.enable_cuda_graph, + ) ) - ) - logger.info("Cuda graph runner enabled, wrapping transformer.forward") - self.transformer.forward = self.cuda_graph_runner.wrap(self.transformer.forward) + logger.info("Cuda graph runner enabled, wrapping transformer.forward") + self.transformer.forward = self.cuda_graph_runner.wrap(self.transformer.forward)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/visual_gen/pipeline.py` around lines 40 - 57, The early "return" inside the CUDA-graph block exits __init__ entirely and may inadvertently skip later initialization; instead remove the return and restructure the condition so torch.compile branches don't exit the constructor—e.g., change the nested if to an if/elif or add an else before creating CUDAGraphRunner so when self.model_config.pipeline.enable_cuda_graph is true but enable_torch_compile is also true you log the warning but continue __init__; update the block around model_config.pipeline.enable_cuda_graph, enable_torch_compile, CUDAGraphRunner, self.cuda_graph_runner, and self.transformer.forward = self.cuda_graph_runner.wrap(...) accordingly.
185-195:_find_transformer_blocksonly matchesModuleListwithlen > 1.A model with a single transformer block (e.g., a tiny debug model) would not match, falling through to whole-module compilation. This is probably fine in practice, but worth documenting the threshold choice.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/visual_gen/pipeline.py` around lines 185 - 195, The helper _find_transformer_blocks currently only recognizes nn.ModuleList children with len > 1, which skips models that contain a single transformer block; update the predicate in _find_transformer_blocks (the loop over model.named_children and the isinstance(child, nn.ModuleList) check) to accept ModuleList instances with len >= 1 (or len(child) > 0) so single-block ModuleLists are returned, and update the method docstring to state it returns attribute names containing nn.ModuleList with one or more elements to document the threshold change.tensorrt_llm/_torch/visual_gen/pipeline_loader.py (1)
218-223: Import style: prefer importing the module, not the class directly.Per coding guidelines, use
from tensorrt_llm._torch.pyexecutor import layerwise_nvtx_markerrather than importingLayerwiseNvtxMarkerdirectly.Suggested change
if config.pipeline.enable_layerwise_nvtx_marker: - from tensorrt_llm._torch.pyexecutor.layerwise_nvtx_marker import LayerwiseNvtxMarker + from tensorrt_llm._torch.pyexecutor import layerwise_nvtx_marker - marker = LayerwiseNvtxMarker() + marker = layerwise_nvtx_marker.LayerwiseNvtxMarker() module_prefix = pipeline.__class__.__name__ marker.register_hooks(pipeline.transformer, module_prefix)As per coding guidelines: "Python imports must use
from package.subpackage import modulestyle; never usefrom module import Class."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/visual_gen/pipeline_loader.py` around lines 218 - 223, Replace the direct class import with a module-level import and instantiate the class via the module: change the import to use "from tensorrt_llm._torch.pyexecutor import layerwise_nvtx_marker", then create the marker with "layerwise_nvtx_marker.LayerwiseNvtxMarker()" and keep the subsequent calls (marker.register_hooks(pipeline.transformer, module_prefix)) and the surrounding conditional (config.pipeline.enable_layerwise_nvtx_marker) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/visual_gen/visual_gen_wan_i2v.py`:
- Around line 199-207: The code references args.enable_cudagraph in the pipeline
config but parse_args() never defines that CLI flag, causing AttributeError; add
a new argument definition in parse_args() (similar to visual_gen_wan_t2v.py)
such as adding parser.add_argument("--enable_cudagraph", action="store_true",
default=False, help="enable CUDA graph usage") (place it near the other pipeline
flags, e.g., before the --disable_torch_compile block) so args.enable_cudagraph
is available when building the "pipeline" dict.
In `@tensorrt_llm/_torch/visual_gen/attention_backend/trtllm.py`:
- Around line 194-209: The method _concat_qkv is unconditionally decorated with
`@torch.compile` which forces compilation regardless of pipeline config; change it
to follow the project's pattern and respect the config by making the decorator
conditional (e.g., use `@torch.compile`(disable=not _is_torch_compile()) or
`@torch.compile`(disable=not enable_torch_compile)) or remove the decorator
entirely and rely on BasePipeline.torch_compile() block-level compilation;
update the decorator on _concat_qkv accordingly so it no longer bypasses the
config-controlled torch.compile behavior.
In `@tensorrt_llm/_torch/visual_gen/cuda_graph_runner.py`:
- Around line 55-64: get_graph_key currently only uses tensor shapes so
non-tensor arguments (ints, bools, strings, enums, etc.) are ignored and
different values will reuse the wrong CUDA graph; update get_graph_key to
include non-tensor args as part of the returned key by extending the existing
tuple with a stable, hashable representation of each non-tensor positional arg
and each non-tensor kwarg (preserve kwarg key order by sorting keys, e.g.,
include (k, value_repr) for kwargs); keep using tuple(shape) for torch.Tensor
inputs but for non-tensors use a deterministic representation such as
(type_name, repr(value)) or a safe serialization (pickle or json for primitives)
to avoid unhashable objects, then return the combined tuple as the KeyType so
graphs are keyed by both tensor shapes and non-tensor argument values.
- Around line 1-11: Add the required NVIDIA Apache-2.0 copyright header to the
top of the new module tensorrt_llm._torch.visual_gen.cuda_graph_runner (i.e.,
prepend the standard NVIDIA file header with copyright year and license text
before any imports), ensuring the year is correct and the header matches other
repository files.
In `@tensorrt_llm/_torch/visual_gen/pipeline.py`:
- Around line 146-183: The torch_compile method ignores
pipeline_config.enable_fullgraph; update both places that call torch.compile in
torch_compile (the per-block compile inside the loop that builds compiled_blocks
and the whole-module compile path that sets compiled) to pass
fullgraph=pipeline_config.enable_fullgraph in addition to mode=compile_mode and
dynamic=False so the config option is respected; reference the torch_compile
method, pipeline_config.torch_compile_mode, pipeline_config.enable_fullgraph,
and the per-block loop that constructs compiled_blocks and the else branch that
sets self.<name>=compiled to locate the two call sites.
---
Outside diff comments:
In `@tensorrt_llm/_torch/visual_gen/config.py`:
- Around line 195-211: The PipelineConfig flag enable_fullgraph is never
forwarded to torch.compile, so update BasePipeline.torch_compile() to pass the
fullgraph argument from the config: when calling torch.compile(block,
mode=compile_mode, dynamic=False) include fullgraph=self.config.enable_fullgraph
(or equivalent access to PipelineConfig) so the user-specified flag is
respected; ensure the call site in the BasePipeline.torch_compile method
references PipelineConfig.enable_fullgraph and retains existing mode/dynamic
behavior.
In `@tensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan.py`:
- Around line 1-17: This file is missing the required NVIDIA/Apache-2.0
copyright header; add the standard NVIDIA copyright and Apache-2.0 license
header at the very top of
tensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan.py (above the imports)
with the year of the latest meaningful modification, matching the project's
header format used in other .py files; ensure the header remains a comment block
and does not alter imports or symbols like AutoencoderKLWan,
FlowMatchEulerDiscreteScheduler, BasePipeline, register_pipeline, etc.
In `@tensorrt_llm/_torch/visual_gen/pipeline.py`:
- Around line 1-13: This file is missing the required NVIDIA Apache-2.0
copyright header; add the standard NVIDIA header (including the correct year of
latest meaningful modification and the Apache-2.0 license notice) at the top of
the module before any imports so it precedes symbols like imports of torch,
Mapping, CUDAGraphRunner, and classes/functions in this file (e.g.,
CUDAGraphRunner, CUDAGraphRunnerConfig, TeaCacheBackend) to comply with
repository coding guidelines.
---
Nitpick comments:
In `@tensorrt_llm/_torch/visual_gen/cuda_graph_runner.py`:
- Around line 129-136: The current flow in get_graph_key/use-site calls
capture() then immediately replay() which is redundant; after calling
self.capture(key, fn, args, kwargs) return the cached output from
self.graph_outputs[key] instead of calling self.replay again. Update the branch
that checks if key not in self.graphs to call self.capture(...) and then return
self.graph_outputs[key], and simplify/remove the duplicate replay path so only
the else path calls self.replay(key, args, kwargs).
- Around line 77-82: The warmup loop currently calls gc.collect() and
torch.cuda.empty_cache() every iteration which is unnecessary; instead, keep the
loop that calls fn(*static_args, **static_kwargs) and torch.cuda.synchronize()
for self.WARMUP_STEPS, then call gc.collect() and torch.cuda.empty_cache() once
after the loop and just before creating/using torch.cuda.CUDAGraph() (or before
capture) to reduce overhead while still freeing memory; locate the warmup loop
that references self.WARMUP_STEPS, fn, and torch.cuda.CUDAGraph() and move the
two cleanup calls out of the loop accordingly.
In `@tensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan.py`:
- Around line 245-278: The _run_warmup method currently calls the full forward
pipeline which will fail if components are missing; before running warmup verify
required components (tokenizer, text_encoder, VAE/decoder, scheduler or whatever
forward() needs) are present on self (or not listed in
self.model_config.skip_components) and either skip warmup with a clear
processLogger.error/ warning or raise a descriptive exception; also tighten the
variant detection by checking the checkpoint filename or using a regex/boundary
match instead of a plain substring on checkpoint_path (refer to checkpoint_path,
variant_shapes and the loop that sets height/width/num_frames in _run_warmup) so
parent directories containing “480P” don’t produce false matches.
In `@tensorrt_llm/_torch/visual_gen/pipeline_loader.py`:
- Around line 218-223: Replace the direct class import with a module-level
import and instantiate the class via the module: change the import to use "from
tensorrt_llm._torch.pyexecutor import layerwise_nvtx_marker", then create the
marker with "layerwise_nvtx_marker.LayerwiseNvtxMarker()" and keep the
subsequent calls (marker.register_hooks(pipeline.transformer, module_prefix))
and the surrounding conditional (config.pipeline.enable_layerwise_nvtx_marker)
unchanged.
In `@tensorrt_llm/_torch/visual_gen/pipeline.py`:
- Around line 40-57: The early "return" inside the CUDA-graph block exits
__init__ entirely and may inadvertently skip later initialization; instead
remove the return and restructure the condition so torch.compile branches don't
exit the constructor—e.g., change the nested if to an if/elif or add an else
before creating CUDAGraphRunner so when
self.model_config.pipeline.enable_cuda_graph is true but enable_torch_compile is
also true you log the warning but continue __init__; update the block around
model_config.pipeline.enable_cuda_graph, enable_torch_compile, CUDAGraphRunner,
self.cuda_graph_runner, and self.transformer.forward =
self.cuda_graph_runner.wrap(...) accordingly.
- Around line 185-195: The helper _find_transformer_blocks currently only
recognizes nn.ModuleList children with len > 1, which skips models that contain
a single transformer block; update the predicate in _find_transformer_blocks
(the loop over model.named_children and the isinstance(child, nn.ModuleList)
check) to accept ModuleList instances with len >= 1 (or len(child) > 0) so
single-block ModuleLists are returned, and update the method docstring to state
it returns attribute names containing nn.ModuleList with one or more elements to
document the threshold change.
Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
db18678 to
a3f6fbe
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #36171 [ run ] triggered by Bot. Commit: |
|
PR_Github #36171 [ run ] completed with state
|
Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
|
/bot run --disable-fail-fast |
|
PR_Github #36193 [ run ] triggered by Bot. Commit: |
|
PR_Github #36193 [ run ] completed with state
|
|
/bot run --disable-fail-fast --reuse-test |
|
PR_Github #36276 [ run ] triggered by Bot. Commit: |
|
PR_Github #36276 [ run ] completed with state
|
Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
|
/bot run --disable-fail-fast --reuse-test |
|
PR_Github #36289 [ run ] triggered by Bot. Commit: |
|
PR_Github #36289 [ run ] completed with state |
zhenhuaw-me
left a comment
There was a problem hiding this comment.
Thanks for creating this foundation support in the main branch. I filed several JIRAs to follow up for the sake of user experience.
| @@ -434,6 +440,7 @@ def test_fp8_vs_bf16_memory_comparison(checkpoint_exists): | |||
| checkpoint_path=CHECKPOINT_PATH, | |||
| quant_config={"quant_algo": "FP8_BLOCK_SCALES", "dynamic": True}, | |||
| skip_components=SKIP_HEAVY_COMPONENTS, | |||
| pipeline={"warmup_steps": 0}, | |||
There was a problem hiding this comment.
Does it make sense to keep 1 test with the "default (or mostly used) warmup steps" to protect the general use case? (also apply to other tests)
| # Double warmup steps to also warmup the 2nd transformer | ||
| warmup_steps = warmup_steps * 2 | ||
|
|
||
| for height, width, num_frames in self.common_warmup_shapes: |
There was a problem hiding this comment.
Filed https://jirasw.nvidia.com/browse/TRTLLM-11107 as follow ups
| negative_prompt="", | ||
| height=height, | ||
| width=width, | ||
| num_frames=num_frames, |
There was a problem hiding this comment.
https://jirasw.nvidia.com/browse/TRTLLM-11112 to track the improvements
| torch_compile_models: str = PipelineComponent.TRANSFORMER | ||
| torch_compile_mode: str = "default" | ||
| torch_compile_models: List[str] = [] # empty = auto detect transformer components | ||
| torch_compile_mode: Literal["default", "max-autotune", "reduce-overhead"] = "default" |
There was a problem hiding this comment.
Filed https://jirasw.nvidia.com/browse/TRTLLM-11115 to follow up
| and (req.height, req.width, req.num_frames) not in self.pipeline.common_warmup_shapes | ||
| ): | ||
| logger.warning( | ||
| f"Requested shape (height={req.height}, width={req.width}, num_frames={req.num_frames}) " |
There was a problem hiding this comment.
The fix will be https://jirasw.nvidia.com/browse/TRTLLM-11104
| "--disable_torch_compile", action="store_true", help="Disable TorchCompile acceleration" | ||
| ) | ||
| parser.add_argument( | ||
| "--torch_compile_models", |
Summary by CodeRabbit
Description
Performance:
WAN 2.1 w/ torch compile + warmup: 30-90% lower latency
WAN 2.2 w/torch compile + warmup: 15-60% lower latency
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.