[None][infra] AutoDeploy: Dump graph IR after every transform#11045
[None][infra] AutoDeploy: Dump graph IR after every transform#11045bmarimuthu-nv merged 3 commits intoNVIDIA:mainfrom
Conversation
|
/bot run |
|
PR_Github #33784 [ run ] triggered by Bot. Commit: |
e4123c0 to
850e40b
Compare
|
/bot run |
|
PR_Github #33786 [ run ] triggered by Bot. Commit: |
|
PR_Github #33786 [ run ] completed with state
|
|
/bot run |
|
PR_Github #33828 [ run ] triggered by Bot. Commit: |
|
PR_Github #33828 [ run ] completed with state
|
|
Looks good to me! |
d59add0 to
d572fb6
Compare
|
/bot run |
|
PR_Github #33905 [ run ] triggered by Bot. Commit: |
|
PR_Github #33905 [ run ] completed with state |
galagam
left a comment
There was a problem hiding this comment.
Looks great, tried it out today on my debug :)
lucaslie
left a comment
There was a problem hiding this comment.
feel free to merge it after addressing my comments
|
/bot run |
|
PR_Github #34540 [ run ] triggered by Bot. Commit: |
|
PR_Github #34540 [ run ] completed with state
|
|
/bot run |
|
PR_Github #34598 [ run ] triggered by Bot. Commit: |
|
PR_Github #34598 [ run ] completed with state
|
f82e817 to
24ac5e3
Compare
|
/bot run |
|
PR_Github #34668 [ run ] triggered by Bot. Commit: |
|
PR_Github #34668 [ run ] completed with state
|
24ac5e3 to
4c3152c
Compare
|
/bot run |
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
4c3152c to
16c3580
Compare
|
/bot run |
📝 WalkthroughWalkthroughAdds graph debugging infrastructure to the TensorRT LLM auto-deploy pipeline. Node renaming in the export flow encodes module hierarchy into graph node names, while a new GraphWriter utility provides SSA-style graph serialization with shape and dtype metadata. Transform pipeline hooks dump resulting graphs automatically when enabled via environment variable. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@tensorrt_llm/_torch/auto_deploy/utils/graph_writer.py`:
- Around line 1-11: Add the required NVIDIA copyright header (with the current
year) to the top of the new module graph_writer.py: insert the standard
multi-line copyright header comment before the imports in
tensorrt_llm/_torch/auto_deploy/utils/graph_writer.py so all source files
include the NVIDIA copyright and year of latest modification, preserving
existing imports and encoding.
🧹 Nitpick comments (6)
tensorrt_llm/_torch/auto_deploy/transform/interface.py (1)
491-492: Consider wrappingdump_graphin a try-except to prevent debug infrastructure from breaking the pipeline.If
AD_DUMP_GRAPHS_DIRis set and a file-system error occurs (e.g., permission denied, disk full), this will propagate an unhandled exception and abort the transform pipeline. A debug/diagnostic feature should not crash the main flow.Proposed safeguard
# Dump graph after transform for debugging (controlled by AD_DUMP_GRAPHS_DIR env var) - graph_writer.dump_graph(mod, t_name, self.config.stage.value) + try: + graph_writer.dump_graph(mod, t_name, self.config.stage.value) + except Exception as e: + ad_logger.warning(f"Failed to dump graph for {t_name}: {e}")tensorrt_llm/_torch/auto_deploy/export/export.py (1)
218-225: Op-name extraction fallback could produce noisy names.The
str(target).split(".")[-1]fallback (Line 225) for targets without__name__or_namemay yield unexpected fragments for complex target representations (e.g.,OpOverloadPacketobjects). This is a minor concern since mostcall_functiontargets will have__name__, but worth a note.tensorrt_llm/_torch/auto_deploy/utils/graph_writer.py (4)
38-91: Missing docstring fordump_ssa_with_meta, andnode.kwargsare not emitted.
dump_ssa_with_metais a public function (no leading underscore) but lacks a docstring. Per guidelines, prefer docstrings for interfaces that may be used outside a file.The SSA dump iterates only over
node.args(Line 50) but ignoresnode.kwargs, which can carry important information likedtype,device,memory_format, etc. This reduces the debugging value of the IR output.
get_attrnodes are silently skipped, so parameter/buffer references will appear as undefined names in the IR. A brief comment or a placeholder line would improve readability of dumped files.Suggested kwargs handling (illustrative)
for arg in node.args: if hasattr(arg, "name"): # Look up the arg node's metadata for shape/dtype if hasattr(arg, "meta") and "val" in arg.meta: arg_shape_dtype = _get_shape_dtype_str(arg.meta["val"]) input_vars.append(f"%{arg.name} : {arg_shape_dtype}") else: input_vars.append(f"%{arg.name} : ? : unknown") else: input_vars.append(str(arg)) + + # Include keyword arguments + for key, val in node.kwargs.items(): + if hasattr(val, "name"): + input_vars.append(f"{key}=%{val.name}") + else: + input_vars.append(f"{key}={val}")
112-119:shutil.rmtreeon existing directory is destructive — consider adding a safety check.If a user accidentally sets
AD_DUMP_GRAPHS_DIRto an important path, Line 116 will recursively delete its entire contents. Consider either:
- Only deleting files matching the expected
*.txtpattern instead ofrmtree.- Adding a sentinel file (e.g.,
.ad_graph_dump) on first creation, and only clearing the directory if the sentinel exists.Sentinel-based approach
if not self._dump_dir_initialized: dump_dir_path = Path(self._dump_dir) + sentinel = dump_dir_path / ".ad_graph_dump" if dump_dir_path.exists(): - shutil.rmtree(dump_dir_path) - dump_dir_path.mkdir(parents=True, exist_ok=True) + if sentinel.exists(): + shutil.rmtree(dump_dir_path) + else: + self._logger.warning( + f"Directory {self._dump_dir} exists but was not created by " + "GraphWriter. Skipping cleanup." + ) + dump_dir_path.mkdir(parents=True, exist_ok=True) + sentinel.touch()
22-28:_get_shape_strmay raise on non-concrete symbolic dims.Line 26:
int(d)is called whenstr(d).isdigit()is true, which is safe for regular integers. However, if a symbolic dimension's string representation happens to be numeric (e.g., a guarded SymInt), callingint(d)would be redundant —str(d)already gives the digit string. You can simplify this to juststr(d)for all cases:- dims = [str(int(d)) if str(d).isdigit() else str(d) for d in val.shape] + dims = [str(d) for d in val.shape]
94-146: Singleton instantiated at module import time reads env var eagerly.
graph_writer = GraphWriter()on Line 146 readsAD_DUMP_GRAPHS_DIRonce during import. If the env var is set after the module is imported (e.g., in test fixtures), the writer won't pick it up. This is acceptable for CLI usage but worth documenting.Also, the
__init__signature accepts no parameters, so there's no way to override the dump directory programmatically (e.g., in tests). Consider exposing aconfigure(dump_dir: str)method or reading the env var lazily indump_graph.
|
PR_Github #35143 [ run ] triggered by Bot. Commit: |
|
PR_Github #35143 [ run ] completed with state |
…#11045) Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com> Signed-off-by: Ahmet Inci <ainci@nvidia.com>
Summary by CodeRabbit
Description
It is easy to debug with IRs. It also helps cursor agent to analyze faster.
This PR:
model_layers_0_self_attn_kv_b_proj_torch_linear_simple_3Usage:
Test Coverage
Tested locally
Example dumped folder:
Example output in IR:
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.