Skip to content

[Bug]: AutoDeploy Fix the NemotronMOE MMLU OOM issue #10580

@nvchenghaoz

Description

@nvchenghaoz

System Info

H100 CW

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

pytest tests/integration/defs/accuracy/test_llm_api_autodeploy.py::TestNemotronMOE -s -vv

Expected behavior

Test pass

actual behavior

FAILED tests/integration/defs/accuracy/test_llm_api_autodeploy.py::TestNemotronMOE::test_bf16 - tensorrt_llm.executor.utils.RequestError: CUDA out of memory. Tried to allocate 3.99 GiB. GPU 0 has a total capacity of 79.11 GiB of which 3.16 GiB is free. Process 328448 has 592.00 MiB memory in use. Including non-PyTorch memory, this process has 75.36 GiB memory in use. Of the allocated memory 67.16 GiB is allocated by PyTorch, with 121.83 MiB allocated in private pools (e.g., CUDA Graphs), and 6.90 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

additional notes

N/A

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

Labels

AutoDeploy<NV> AutoDeploy BackendMemoryMemory utilization in TRTLLM: leak/OOM handling, footprint optimization, memory profiling.bugSomething isn't working

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions