Skip to content

Comments

[None][feat] Visual Gen: add cuda graphs; torch compile; nvtx; warmup#11554

Merged
chang-l merged 4 commits intoNVIDIA:mainfrom
NVShreyas:user/shreyasm/visual-gen-compile
Feb 20, 2026
Merged

[None][feat] Visual Gen: add cuda graphs; torch compile; nvtx; warmup#11554
chang-l merged 4 commits intoNVIDIA:mainfrom
NVShreyas:user/shreyasm/visual-gen-compile

Conversation

@NVShreyas
Copy link
Collaborator

@NVShreyas NVShreyas commented Feb 17, 2026

Summary by CodeRabbit

  • New Features
    • Added TorchCompile support with multiple compilation modes (default, max-autotune, reduce-overhead) and fullgraph option for flexible optimization
    • Introduced CUDA graph optimization to improve inference performance
    • Added configurable warmup steps for model initialization and compilation
    • Integrated layer-wise NVTX markers for detailed performance profiling and analysis

Description

Performance:

WAN 2.1 w/ torch compile + warmup: 30-90% lower latency
WAN 2.2 w/torch compile + warmup: 15-60% lower latency

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 17, 2026

📝 Walkthrough

Walkthrough

This PR extends the VisualGen pipeline with TorchCompile and CUDA graph acceleration capabilities. New CLI options and configuration fields enable these features across example scripts. A CUDA graph runner manages per-key graph lifecycles with warm-up routines. The pipeline orchestrates conditional torch compilation, CUDA graph setup, and NVTX profiling instrumentation.

Changes

Cohort / File(s) Summary
CLI argument expansion
examples/visual_gen/visual_gen_wan_i2v.py, examples/visual_gen/visual_gen_wan_t2v.py
Added TorchCompile, CUDA graph, and performance tuning CLI options: --disable_torch_compile, --torch_compile_models, --torch_compile_mode, --enable_fullgraph, --warmup_steps, --enable_layerwise_nvtx_marker. Extended diffusion_config construction with new pipeline section propagating these options.
Configuration schema
tensorrt_llm/_torch/visual_gen/config.py
Updated PipelineConfig: torch_compile_models changed from str to List[str]; torch_compile_mode constrained to Literal["default", "max-autotune", "reduce-overhead"]; added boolean flags enable_fullgraph, enable_cuda_graph, enable_layerwise_nvtx_marker and integer warmup_steps.
CUDA graph runner
tensorrt_llm/_torch/visual_gen/cuda_graph_runner.py
New module introducing CUDAGraphRunnerConfig and CUDAGraphRunner class to manage per-key CUDA graph lifecycles with capture, replay, and wrap utilities. Supports configurable memory pools and warm-up iterations.
Attention backend optimization
tensorrt_llm/_torch/visual_gen/attention_backend/trtllm.py
Added _prepare_metadata and _concat_qkv helper methods to TrtllmAttention for torch.compile compatibility. Refactored forward path to utilize new methods for QKV fusion and metadata preparation.
Pipeline core infrastructure
tensorrt_llm/_torch/visual_gen/pipeline.py
Added torch_compile public method for selective component compilation, _find_transformer_blocks static helper for block detection, warmup public method and _run_warmup protected method for warm-up routines. Integrated CUDA graph runner initialization and NVTX profiling decorators (@nvtx_range) on _scheduler_step and denoise with per-step annotation.
Pipeline-specific implementation
tensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan.py
Added _run_warmup method with variant-aware resolution selection and NVTX range decorators on forward, _encode_prompt, _prepare_latents, and _decode_latents methods for granular profiling.
Pipeline loader orchestration
tensorrt_llm/_torch/visual_gen/pipeline_loader.py
Added post-load logic to conditionally enable torch.compile based on config, invoke pipeline warm-up, and register LayerwiseNvtxMarker hooks on transformer when enabled.

Sequence Diagram

sequenceDiagram
    participant User
    participant Loader as pipeline_loader
    participant Pipeline as BasePipeline
    participant Compiler as torch.compile
    participant CUDA as CUDAGraphRunner
    participant Transformer as Transformer
    participant Inference as Forward Pass

    User->>Loader: Load with config
    Loader->>Pipeline: Initialize pipeline
    
    alt enable_torch_compile
        Loader->>Pipeline: torch_compile()
        Pipeline->>Compiler: Compile selected modules
        Compiler-->>Pipeline: Compiled modules
    end
    
    alt enable_cuda_graph
        Loader->>CUDA: Initialize CUDAGraphRunner
        CUDA-->>Pipeline: Graph manager ready
    end
    
    Loader->>Pipeline: warmup(warmup_steps)
    Pipeline->>Transformer: Run warmup inference
    Transformer-->>Pipeline: Warm-up complete
    
    alt enable_layerwise_nvtx_marker
        Loader->>Transformer: Register NVTX hooks
    end
    
    Loader-->>User: Pipeline ready
    
    User->>Pipeline: forward(prompt, ...)
    Pipeline->>Pipeline: denoise loop (with `@nvtx_range`)
    loop each denoise step
        Pipeline->>CUDA: wrap(denoise_step)
        alt graph exists for key
            CUDA->>CUDA: replay graph
        else new key
            CUDA->>Transformer: capture graph
            Transformer-->>CUDA: graph captured
            CUDA->>CUDA: replay graph
        end
        CUDA-->>Pipeline: step output
    end
    Pipeline-->>User: Generate output
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning PR description is largely incomplete and lacks essential details about the changes. Add a comprehensive description explaining what features are being added (CUDA graphs, torch.compile, NVTX, warmup), why they are needed, and provide detailed test coverage information. Include a properly formatted PR title with ticket/issue reference and type.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main changes: adding CUDA graphs, torch compile support, NVTX instrumentation, and warmup functionality to Visual Gen, which aligns with the raw summary content.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
tensorrt_llm/_torch/visual_gen/pipeline.py (1)

1-13: ⚠️ Potential issue | 🟡 Minor

Missing NVIDIA copyright header.

This source file should include the NVIDIA Apache 2.0 copyright header per repository coding guidelines.

As per coding guidelines: "All source files must contain an NVIDIA copyright header with the year of latest meaningful modification."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/visual_gen/pipeline.py` around lines 1 - 13, This file is
missing the required NVIDIA Apache-2.0 copyright header; add the standard NVIDIA
header (including the correct year of latest meaningful modification and the
Apache-2.0 license notice) at the top of the module before any imports so it
precedes symbols like imports of torch, Mapping, CUDAGraphRunner, and
classes/functions in this file (e.g., CUDAGraphRunner, CUDAGraphRunnerConfig,
TeaCacheBackend) to comply with repository coding guidelines.
tensorrt_llm/_torch/visual_gen/config.py (1)

195-211: ⚠️ Potential issue | 🟡 Minor

enable_fullgraph is declared but never passed to torch.compile().

The config field enable_fullgraph (line 201) is defined and exposed as a CLI argument, but BasePipeline.torch_compile() in pipeline.py (lines 173–178) calls torch.compile(block, mode=compile_mode, dynamic=False) without passing the fullgraph parameter. The flag is silently ignored.

Fix in `tensorrt_llm/_torch/visual_gen/pipeline.py` `torch_compile()` method
+        fullgraph = pipeline_config.enable_fullgraph
+
         for name in pipeline_config.torch_compile_models:
             ...
                         compiled_blocks.append(
-                            torch.compile(block, mode=compile_mode, dynamic=False)
+                            torch.compile(block, mode=compile_mode, dynamic=False, fullgraph=fullgraph)
                         )
             ...
-                compiled = torch.compile(model, mode=compile_mode, dynamic=False)
+                compiled = torch.compile(model, mode=compile_mode, dynamic=False, fullgraph=fullgraph)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/visual_gen/config.py` around lines 195 - 211, The
PipelineConfig flag enable_fullgraph is never forwarded to torch.compile, so
update BasePipeline.torch_compile() to pass the fullgraph argument from the
config: when calling torch.compile(block, mode=compile_mode, dynamic=False)
include fullgraph=self.config.enable_fullgraph (or equivalent access to
PipelineConfig) so the user-specified flag is respected; ensure the call site in
the BasePipeline.torch_compile method references PipelineConfig.enable_fullgraph
and retains existing mode/dynamic behavior.
tensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan.py (1)

1-17: ⚠️ Potential issue | 🟡 Minor

Missing NVIDIA copyright header.

This source file has no Apache 2.0 / NVIDIA copyright header. Per coding guidelines, all .py source files must contain one.

As per coding guidelines: "All source files must contain an NVIDIA copyright header with the year of latest meaningful modification."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan.py` around lines 1 -
17, This file is missing the required NVIDIA/Apache-2.0 copyright header; add
the standard NVIDIA copyright and Apache-2.0 license header at the very top of
tensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan.py (above the imports)
with the year of the latest meaningful modification, matching the project's
header format used in other .py files; ensure the header remains a comment block
and does not alter imports or symbols like AutoencoderKLWan,
FlowMatchEulerDiscreteScheduler, BasePipeline, register_pipeline, etc.
🧹 Nitpick comments (6)
tensorrt_llm/_torch/visual_gen/cuda_graph_runner.py (2)

129-136: Redundant replay() after capture() on first call.

After capture() runs the function and stores the output, replay() is called immediately with the same inputs — this replays the graph unnecessarily since the output is already stored. You can return self.graph_outputs[key] directly.

Suggested change
             if key not in self.graphs:
                 self.capture(key, fn, args, kwargs)
-                return self.replay(key, args, kwargs)
-            else:
-                return self.replay(key, args, kwargs)
+            return self.replay(key, args, kwargs)

Or skip the replay entirely on first capture:

             if key not in self.graphs:
                 self.capture(key, fn, args, kwargs)
+                return self.graph_outputs[key]
             else:
                 return self.replay(key, args, kwargs)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/visual_gen/cuda_graph_runner.py` around lines 129 - 136,
The current flow in get_graph_key/use-site calls capture() then immediately
replay() which is redundant; after calling self.capture(key, fn, args, kwargs)
return the cached output from self.graph_outputs[key] instead of calling
self.replay again. Update the branch that checks if key not in self.graphs to
call self.capture(...) and then return self.graph_outputs[key], and
simplify/remove the duplicate replay path so only the else path calls
self.replay(key, args, kwargs).

77-82: gc.collect() + torch.cuda.empty_cache() inside the warmup loop is excessive.

These are called every warmup iteration (default 2). Moving them after the loop would reduce overhead while still freeing memory before capture.

Suggested change
         for _ in range(self.WARMUP_STEPS):
             fn(*static_args, **static_kwargs)
             torch.cuda.synchronize()
-            gc.collect()
-            torch.cuda.empty_cache()
+
+        gc.collect()
+        torch.cuda.empty_cache()
 
         with torch.cuda.graph(graph, pool=self.memory_pool):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/visual_gen/cuda_graph_runner.py` around lines 77 - 82,
The warmup loop currently calls gc.collect() and torch.cuda.empty_cache() every
iteration which is unnecessary; instead, keep the loop that calls
fn(*static_args, **static_kwargs) and torch.cuda.synchronize() for
self.WARMUP_STEPS, then call gc.collect() and torch.cuda.empty_cache() once
after the loop and just before creating/using torch.cuda.CUDAGraph() (or before
capture) to reduce overhead while still freeing memory; locate the warmup loop
that references self.WARMUP_STEPS, fn, and torch.cuda.CUDAGraph() and move the
two cleanup calls out of the loop accordingly.
tensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan.py (1)

245-278: Warmup implementation runs the full pipeline including text encoding and VAE decode.

This is thorough but note it requires all standard components (tokenizer, text_encoder, VAE, scheduler) to be loaded, which means warmup will fail if any are in skip_components. Consider adding a guard or at least a clear error message if a required component is missing during warmup.

Also, the variant detection via substring match in checkpoint_path (lines 260–264) is fragile — a path containing "480P" in a parent directory name could trigger an unintended match.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan.py` around lines 245 -
278, The _run_warmup method currently calls the full forward pipeline which will
fail if components are missing; before running warmup verify required components
(tokenizer, text_encoder, VAE/decoder, scheduler or whatever forward() needs)
are present on self (or not listed in self.model_config.skip_components) and
either skip warmup with a clear processLogger.error/ warning or raise a
descriptive exception; also tighten the variant detection by checking the
checkpoint filename or using a regex/boundary match instead of a plain substring
on checkpoint_path (refer to checkpoint_path, variant_shapes and the loop that
sets height/width/num_frames in _run_warmup) so parent directories containing
“480P” don’t produce false matches.
tensorrt_llm/_torch/visual_gen/pipeline.py (2)

40-57: CUDA graph + torch.compile mutual exclusion is handled correctly, but the early return could be clearer.

The return at line 49 exits __init__ entirely. Currently this only skips the CUDA graph runner setup (lines 51–57), which is the intended behavior. However, if future code is added after this block in __init__, it will be silently skipped. Consider restructuring to use elif or an early guard that doesn't exit the constructor.

Suggested restructuring
         if self.model_config.pipeline.enable_cuda_graph and self.transformer is not None:
             if self.model_config.pipeline.enable_torch_compile:
                 logger.warning(
                     "Cuda graphs with torch compile is not supported yet. "
                     "Only using torch compile for better performance. "
                 )
-                return
-
-            self.cuda_graph_runner = CUDAGraphRunner(
-                CUDAGraphRunnerConfig(
-                    use_cuda_graph=model_config.pipeline.enable_cuda_graph,
+            else:
+                self.cuda_graph_runner = CUDAGraphRunner(
+                    CUDAGraphRunnerConfig(
+                        use_cuda_graph=model_config.pipeline.enable_cuda_graph,
+                    )
                 )
-            )
-            logger.info("Cuda graph runner enabled, wrapping transformer.forward")
-            self.transformer.forward = self.cuda_graph_runner.wrap(self.transformer.forward)
+                logger.info("Cuda graph runner enabled, wrapping transformer.forward")
+                self.transformer.forward = self.cuda_graph_runner.wrap(self.transformer.forward)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/visual_gen/pipeline.py` around lines 40 - 57, The early
"return" inside the CUDA-graph block exits __init__ entirely and may
inadvertently skip later initialization; instead remove the return and
restructure the condition so torch.compile branches don't exit the
constructor—e.g., change the nested if to an if/elif or add an else before
creating CUDAGraphRunner so when self.model_config.pipeline.enable_cuda_graph is
true but enable_torch_compile is also true you log the warning but continue
__init__; update the block around model_config.pipeline.enable_cuda_graph,
enable_torch_compile, CUDAGraphRunner, self.cuda_graph_runner, and
self.transformer.forward = self.cuda_graph_runner.wrap(...) accordingly.

185-195: _find_transformer_blocks only matches ModuleList with len > 1.

A model with a single transformer block (e.g., a tiny debug model) would not match, falling through to whole-module compilation. This is probably fine in practice, but worth documenting the threshold choice.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/visual_gen/pipeline.py` around lines 185 - 195, The
helper _find_transformer_blocks currently only recognizes nn.ModuleList children
with len > 1, which skips models that contain a single transformer block; update
the predicate in _find_transformer_blocks (the loop over model.named_children
and the isinstance(child, nn.ModuleList) check) to accept ModuleList instances
with len >= 1 (or len(child) > 0) so single-block ModuleLists are returned, and
update the method docstring to state it returns attribute names containing
nn.ModuleList with one or more elements to document the threshold change.
tensorrt_llm/_torch/visual_gen/pipeline_loader.py (1)

218-223: Import style: prefer importing the module, not the class directly.

Per coding guidelines, use from tensorrt_llm._torch.pyexecutor import layerwise_nvtx_marker rather than importing LayerwiseNvtxMarker directly.

Suggested change
         if config.pipeline.enable_layerwise_nvtx_marker:
-            from tensorrt_llm._torch.pyexecutor.layerwise_nvtx_marker import LayerwiseNvtxMarker
+            from tensorrt_llm._torch.pyexecutor import layerwise_nvtx_marker
 
-            marker = LayerwiseNvtxMarker()
+            marker = layerwise_nvtx_marker.LayerwiseNvtxMarker()
             module_prefix = pipeline.__class__.__name__
             marker.register_hooks(pipeline.transformer, module_prefix)

As per coding guidelines: "Python imports must use from package.subpackage import module style; never use from module import Class."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/visual_gen/pipeline_loader.py` around lines 218 - 223,
Replace the direct class import with a module-level import and instantiate the
class via the module: change the import to use "from
tensorrt_llm._torch.pyexecutor import layerwise_nvtx_marker", then create the
marker with "layerwise_nvtx_marker.LayerwiseNvtxMarker()" and keep the
subsequent calls (marker.register_hooks(pipeline.transformer, module_prefix))
and the surrounding conditional (config.pipeline.enable_layerwise_nvtx_marker)
unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/visual_gen/visual_gen_wan_i2v.py`:
- Around line 199-207: The code references args.enable_cudagraph in the pipeline
config but parse_args() never defines that CLI flag, causing AttributeError; add
a new argument definition in parse_args() (similar to visual_gen_wan_t2v.py)
such as adding parser.add_argument("--enable_cudagraph", action="store_true",
default=False, help="enable CUDA graph usage") (place it near the other pipeline
flags, e.g., before the --disable_torch_compile block) so args.enable_cudagraph
is available when building the "pipeline" dict.

In `@tensorrt_llm/_torch/visual_gen/attention_backend/trtllm.py`:
- Around line 194-209: The method _concat_qkv is unconditionally decorated with
`@torch.compile` which forces compilation regardless of pipeline config; change it
to follow the project's pattern and respect the config by making the decorator
conditional (e.g., use `@torch.compile`(disable=not _is_torch_compile()) or
`@torch.compile`(disable=not enable_torch_compile)) or remove the decorator
entirely and rely on BasePipeline.torch_compile() block-level compilation;
update the decorator on _concat_qkv accordingly so it no longer bypasses the
config-controlled torch.compile behavior.

In `@tensorrt_llm/_torch/visual_gen/cuda_graph_runner.py`:
- Around line 55-64: get_graph_key currently only uses tensor shapes so
non-tensor arguments (ints, bools, strings, enums, etc.) are ignored and
different values will reuse the wrong CUDA graph; update get_graph_key to
include non-tensor args as part of the returned key by extending the existing
tuple with a stable, hashable representation of each non-tensor positional arg
and each non-tensor kwarg (preserve kwarg key order by sorting keys, e.g.,
include (k, value_repr) for kwargs); keep using tuple(shape) for torch.Tensor
inputs but for non-tensors use a deterministic representation such as
(type_name, repr(value)) or a safe serialization (pickle or json for primitives)
to avoid unhashable objects, then return the combined tuple as the KeyType so
graphs are keyed by both tensor shapes and non-tensor argument values.
- Around line 1-11: Add the required NVIDIA Apache-2.0 copyright header to the
top of the new module tensorrt_llm._torch.visual_gen.cuda_graph_runner (i.e.,
prepend the standard NVIDIA file header with copyright year and license text
before any imports), ensuring the year is correct and the header matches other
repository files.

In `@tensorrt_llm/_torch/visual_gen/pipeline.py`:
- Around line 146-183: The torch_compile method ignores
pipeline_config.enable_fullgraph; update both places that call torch.compile in
torch_compile (the per-block compile inside the loop that builds compiled_blocks
and the whole-module compile path that sets compiled) to pass
fullgraph=pipeline_config.enable_fullgraph in addition to mode=compile_mode and
dynamic=False so the config option is respected; reference the torch_compile
method, pipeline_config.torch_compile_mode, pipeline_config.enable_fullgraph,
and the per-block loop that constructs compiled_blocks and the else branch that
sets self.<name>=compiled to locate the two call sites.

---

Outside diff comments:
In `@tensorrt_llm/_torch/visual_gen/config.py`:
- Around line 195-211: The PipelineConfig flag enable_fullgraph is never
forwarded to torch.compile, so update BasePipeline.torch_compile() to pass the
fullgraph argument from the config: when calling torch.compile(block,
mode=compile_mode, dynamic=False) include fullgraph=self.config.enable_fullgraph
(or equivalent access to PipelineConfig) so the user-specified flag is
respected; ensure the call site in the BasePipeline.torch_compile method
references PipelineConfig.enable_fullgraph and retains existing mode/dynamic
behavior.

In `@tensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan.py`:
- Around line 1-17: This file is missing the required NVIDIA/Apache-2.0
copyright header; add the standard NVIDIA copyright and Apache-2.0 license
header at the very top of
tensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan.py (above the imports)
with the year of the latest meaningful modification, matching the project's
header format used in other .py files; ensure the header remains a comment block
and does not alter imports or symbols like AutoencoderKLWan,
FlowMatchEulerDiscreteScheduler, BasePipeline, register_pipeline, etc.

In `@tensorrt_llm/_torch/visual_gen/pipeline.py`:
- Around line 1-13: This file is missing the required NVIDIA Apache-2.0
copyright header; add the standard NVIDIA header (including the correct year of
latest meaningful modification and the Apache-2.0 license notice) at the top of
the module before any imports so it precedes symbols like imports of torch,
Mapping, CUDAGraphRunner, and classes/functions in this file (e.g.,
CUDAGraphRunner, CUDAGraphRunnerConfig, TeaCacheBackend) to comply with
repository coding guidelines.

---

Nitpick comments:
In `@tensorrt_llm/_torch/visual_gen/cuda_graph_runner.py`:
- Around line 129-136: The current flow in get_graph_key/use-site calls
capture() then immediately replay() which is redundant; after calling
self.capture(key, fn, args, kwargs) return the cached output from
self.graph_outputs[key] instead of calling self.replay again. Update the branch
that checks if key not in self.graphs to call self.capture(...) and then return
self.graph_outputs[key], and simplify/remove the duplicate replay path so only
the else path calls self.replay(key, args, kwargs).
- Around line 77-82: The warmup loop currently calls gc.collect() and
torch.cuda.empty_cache() every iteration which is unnecessary; instead, keep the
loop that calls fn(*static_args, **static_kwargs) and torch.cuda.synchronize()
for self.WARMUP_STEPS, then call gc.collect() and torch.cuda.empty_cache() once
after the loop and just before creating/using torch.cuda.CUDAGraph() (or before
capture) to reduce overhead while still freeing memory; locate the warmup loop
that references self.WARMUP_STEPS, fn, and torch.cuda.CUDAGraph() and move the
two cleanup calls out of the loop accordingly.

In `@tensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan.py`:
- Around line 245-278: The _run_warmup method currently calls the full forward
pipeline which will fail if components are missing; before running warmup verify
required components (tokenizer, text_encoder, VAE/decoder, scheduler or whatever
forward() needs) are present on self (or not listed in
self.model_config.skip_components) and either skip warmup with a clear
processLogger.error/ warning or raise a descriptive exception; also tighten the
variant detection by checking the checkpoint filename or using a regex/boundary
match instead of a plain substring on checkpoint_path (refer to checkpoint_path,
variant_shapes and the loop that sets height/width/num_frames in _run_warmup) so
parent directories containing “480P” don’t produce false matches.

In `@tensorrt_llm/_torch/visual_gen/pipeline_loader.py`:
- Around line 218-223: Replace the direct class import with a module-level
import and instantiate the class via the module: change the import to use "from
tensorrt_llm._torch.pyexecutor import layerwise_nvtx_marker", then create the
marker with "layerwise_nvtx_marker.LayerwiseNvtxMarker()" and keep the
subsequent calls (marker.register_hooks(pipeline.transformer, module_prefix))
and the surrounding conditional (config.pipeline.enable_layerwise_nvtx_marker)
unchanged.

In `@tensorrt_llm/_torch/visual_gen/pipeline.py`:
- Around line 40-57: The early "return" inside the CUDA-graph block exits
__init__ entirely and may inadvertently skip later initialization; instead
remove the return and restructure the condition so torch.compile branches don't
exit the constructor—e.g., change the nested if to an if/elif or add an else
before creating CUDAGraphRunner so when
self.model_config.pipeline.enable_cuda_graph is true but enable_torch_compile is
also true you log the warning but continue __init__; update the block around
model_config.pipeline.enable_cuda_graph, enable_torch_compile, CUDAGraphRunner,
self.cuda_graph_runner, and self.transformer.forward =
self.cuda_graph_runner.wrap(...) accordingly.
- Around line 185-195: The helper _find_transformer_blocks currently only
recognizes nn.ModuleList children with len > 1, which skips models that contain
a single transformer block; update the predicate in _find_transformer_blocks
(the loop over model.named_children and the isinstance(child, nn.ModuleList)
check) to accept ModuleList instances with len >= 1 (or len(child) > 0) so
single-block ModuleLists are returned, and update the method docstring to state
it returns attribute names containing nn.ModuleList with one or more elements to
document the threshold change.

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
@NVShreyas NVShreyas force-pushed the user/shreyasm/visual-gen-compile branch from db18678 to a3f6fbe Compare February 18, 2026 16:57
Copy link
Collaborator

@chang-l chang-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@NVShreyas
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #36171 [ run ] triggered by Bot. Commit: a3f6fbe Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #36171 [ run ] completed with state SUCCESS. Commit: a3f6fbe
/LLM/main/L0_MergeRequest_PR pipeline #27954 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
@NVShreyas
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #36193 [ run ] triggered by Bot. Commit: aa023da Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #36193 [ run ] completed with state SUCCESS. Commit: aa023da
/LLM/main/L0_MergeRequest_PR pipeline #27973 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@NVShreyas
Copy link
Collaborator Author

/bot run --disable-fail-fast --reuse-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #36276 [ run ] triggered by Bot. Commit: aa023da Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #36276 [ run ] completed with state SUCCESS. Commit: aa023da
/LLM/main/L0_MergeRequest_PR pipeline #28050 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
@NVShreyas
Copy link
Collaborator Author

/bot run --disable-fail-fast --reuse-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #36289 [ run ] triggered by Bot. Commit: 9130adf Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #36289 [ run ] completed with state SUCCESS. Commit: 9130adf
/LLM/main/L0_MergeRequest_PR pipeline #28063 completed with status: 'SUCCESS'

Link to invocation

@chang-l chang-l merged commit 4bee075 into NVIDIA:main Feb 20, 2026
5 checks passed
Copy link
Member

@zhenhuaw-me zhenhuaw-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating this foundation support in the main branch. I filed several JIRAs to follow up for the sake of user experience.

@@ -434,6 +440,7 @@ def test_fp8_vs_bf16_memory_comparison(checkpoint_exists):
checkpoint_path=CHECKPOINT_PATH,
quant_config={"quant_algo": "FP8_BLOCK_SCALES", "dynamic": True},
skip_components=SKIP_HEAVY_COMPONENTS,
pipeline={"warmup_steps": 0},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to keep 1 test with the "default (or mostly used) warmup steps" to protect the general use case? (also apply to other tests)

# Double warmup steps to also warmup the 2nd transformer
warmup_steps = warmup_steps * 2

for height, width, num_frames in self.common_warmup_shapes:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

negative_prompt="",
height=height,
width=width,
num_frames=num_frames,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch_compile_models: str = PipelineComponent.TRANSFORMER
torch_compile_mode: str = "default"
torch_compile_models: List[str] = [] # empty = auto detect transformer components
torch_compile_mode: Literal["default", "max-autotune", "reduce-overhead"] = "default"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and (req.height, req.width, req.num_frames) not in self.pipeline.common_warmup_shapes
):
logger.warning(
f"Requested shape (height={req.height}, width={req.width}, num_frames={req.num_frames}) "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"--disable_torch_compile", action="store_true", help="Disable TorchCompile acceleration"
)
parser.add_argument(
"--torch_compile_models",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TRTLLM-11116 to follow up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants