feat: add TensorRT-LLM as backend#392

cr7258 · 2025-05-01T15:03:22Z

What this PR does / why we need it

Add TensorRT-LLM as a backend, here are the output logs for TensorRT-LLM.

kubectl logs qwen2-0--5b-0
Defaulted container "model-runner" out of: model-runner, model-loader (init)
2025-05-01 14:44:15,167 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT-LLM version: 0.18.0
/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py:2090: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
Loading Model: [1/2]	Loading HF model to memory
198it [00:01, 163.51it/s]
Time: 1.475s
Loading Model: [2/2]	Building TRT-LLM engine
Time: 37.698s
Loading model done.
Total latency: 39.172s
2025-05-01 14:45:04,175 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT-LLM version: 0.18.0
/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py:2090: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
[TensorRT-LLM][INFO] Engine version 0.18.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][WARNING] Fix optionalParams : KV cache reuse disabled because model was not built with paged context FMHA support
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 32768
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (32768) * 24
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 964 MiB
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 324.01 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 950 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.66 GB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 5.76 GB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 22.18 GiB, available: 11.89 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 14609
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 10.70 GiB for max tokens in paged KV cache (934976).
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
INFO:     240.243.170.78:61890 - "GET /health HTTP/1.1" 200 OK
INFO:     240.243.170.78:61902 - "GET /health HTTP/1.1" 200 OK

Send an inference request.

kubectl port-forward qwen2-0--5b-0 8080:8080

curl -X POST http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Accept: application/json" \
    -d '{
        "model": "Qwen/Qwen2-0.5B-Instruct",
        "messages":[{"role": "user", "content": "Who are you?"}]
    }'

# response
{
  "id": "chatcmpl-ecb2f4252cc04f7d9a6842de079487a3",
  "object": "chat.completion",
  "created": 1746111073,
  "model": "models--Qwen--Qwen2-0.5B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "I am an artificial intelligence designed to assist with a variety of tasks, including answering",
        "tool_calls": [
          
        ]
      },
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 23,
    "total_tokens": 39,
    "completion_tokens": 16
  }
}

Which issue(s) this PR fixes

Fixes #205

Special notes for your reviewer

In this PR, I didn't add a preStop hook for TensorRT-LLM for graceful termination. The reason is as follows:

Currently, the latest image version of Triton Inference Server that supports TensorRT-LLM is nvcr.io/nvidia/tritonserver:25.03-trtllm-python-py3, which uses TensorRT-LLM version 0.18.0. However, TensorRT-LLM starts to support the metrics endpoint from version 0.19.0, which is in the Release Candidate. Once the new image is updated with metrics support, we can add the preStop hook.

Does this PR introduce a user-facing change?

add TensorRT-LLM as backend

cr7258 · 2025-05-01T15:10:29Z

/kind feature

kerthcet

Sorry for the late reply, this is great! Thanks @cr7258
/lgtm
/approve

kerthcet · 2025-05-06T12:03:44Z

/lgtm
/approve

cr7258 added 2 commits May 1, 2025 22:46

feat: add TensorRT-LLM as backend

475aa46

update readme

2538c1d

InftyAI-Agent added needs-triage Indicates an issue or PR lacks a label and requires one. needs-priority Indicates a PR lacks a label and requires one. do-not-merge/needs-kind Indicates a PR lacks a label and requires one. labels May 1, 2025

InftyAI-Agent requested a review from kerthcet May 1, 2025 15:03

cr7258 added 2 commits May 1, 2025 23:07

Merge branch 'main' into tersorrt-llm

850be34

update readme

479dc5b

InftyAI-Agent added feature Categorizes issue or PR as related to a new feature. and removed do-not-merge/needs-kind Indicates a PR lacks a label and requires one. labels May 1, 2025

cr7258 added 7 commits May 2, 2025 12:44

remove example to resolve conflicts

ceaf520

remove example to resolve conflicts

087b3d5

fix

1eb8a21

Merge branch 'main' into tersorrt-llm

2b647c9

add tersorrt-llm example

38040db

fix folder name

d0dd391

Merge branch 'main' into tersorrt-llm

acb1f15

kerthcet reviewed May 6, 2025

View reviewed changes

InftyAI-Agent added lgtm Looks good to me, indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 6, 2025

InftyAI-Agent assigned kerthcet May 6, 2025

InftyAI-Agent merged commit fe74a6d into InftyAI:main May 6, 2025
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comments

feat: add TensorRT-LLM as backend#392