TensorRT-LLM is a high-performance inference library for Large Language Models (LLMs) on NVIDIA GPUs. It provides Python APIs, command-line tools, and serving infrastructure to optimize and deploy LLMs with state-of-the-art performance through model quantization, distributed execution strategies, and advanced inference techniques.
This document provides a high-level overview of TensorRT-LLM's architecture, capabilities, and components. For detailed information about specific subsystems:
Sources: README.md1-220 tensorrt_llm/version.py1-16 tensorrt_llm/__init__.py1-50
TensorRT-LLM delivers optimized LLM inference through several key capabilities:
| Capability | Description | Key Techniques |
|---|---|---|
| Quantization | Reduce model memory footprint and increase throughput | FP8, FP4, NVFP4, INT8, INT4 AWQ, SmoothQuant |
| Distributed Execution | Scale to large models across multiple GPUs/nodes | Tensor Parallel (TP), Pipeline Parallel (PP), Expert Parallel (EP), Context Parallel (CP) |
| Speculative Decoding | Accelerate generation through multi-token prediction | Eagle3, MTP (Multi-Token Prediction), N-Gram |
| Attention Optimization | Efficient attention computation | FlashAttention, PagedAttention, Multi-head Latent Attention (MLA) |
| Kernel Optimization | Hardware-specific kernel selection and fusion | AutoTuner, CUDA Graphs, torch.compile integration |
| Memory Management | Efficient KV cache and memory allocation | Block reuse, chunked prefill, paged KV cache |
Sources: README.md1-220 tests/integration/defs/accuracy/test_llm_api_pytorch.py1-100
Figure 1: High-Level System Architecture
The architecture consists of five main layers:
trtllm-build, trtllm-bench, trtllm-serve), Python API (tensorrt_llm.LLM), and OpenAI-compatible serverPyExecutor coordinates request scheduling, resource allocation, and model executionSources: tensorrt_llm/__init__.py1-50 README.md1-100
Figure 2: Python API Entry Point
The LLM class in tensorrt_llm is the primary programmatic interface. Users instantiate it with a model path and optional configuration, then call generate() or generate_async() for inference.
Sources: tensorrt_llm/__init__.py1-50 tests/integration/defs/accuracy/test_llm_api_pytorch.py78-169
Figure 3: Command-Line Tool Hierarchy
Three main CLI tools provide access to TensorRT-LLM functionality:
trtllm-build: Converts model checkpoints to optimized TensorRT enginestrtllm-bench: Benchmarks model performance with dataset preparation and throughput/latency measurementtrtllm-serve: Launches an inference server with OpenAI-compatible API endpointsSources: tests/integration/defs/test_e2e.py446-608 README.md1-100
TensorRT-LLM supports a wide range of modern LLM architectures:
| Model Family | Example Models | Special Features |
|---|---|---|
| Llama | Llama 3.1, 3.2, 3.3, 4 Maverick, 4 Scout | RoPE, GQA, SwiGLU |
| DeepSeek | DeepSeek-V3, DeepSeek-V3-Lite, DeepSeek-R1 | Multi-head Latent Attention, MoE |
| Qwen | Qwen3-8B, Qwen3-30B-A3B, Qwen3-235B-A22B | MoE, Eagle3 support |
| GPT-OSS | GPT-OSS-120B, GPT-OSS-20B | W4AFP8 quantization |
| Mixtral | Mixtral-8x7B | Sparse MoE |
| Gemma | Gemma-2, Gemma-3 | Multi-query attention |
| Mistral | Mistral-7B, Mistral-Large | Sliding window attention |
| Multimodal | Llama-3.2-11B-Vision, Qwen2-VL | Vision-language models |
Sources: tests/integration/defs/accuracy/references/gsm8k.yaml1-100 tests/integration/defs/accuracy/references/mmlu.yaml1-100 tests/integration/test_lists/test-db/l0_h100.yml30-90
Figure 4: Quantization Algorithm Hierarchy
TensorRT-LLM supports multiple quantization algorithms optimized for different hardware generations:
Each algorithm can be combined with KV cache quantization for additional memory savings.
Sources: tests/integration/defs/accuracy/test_llm_api_pytorch.py91-98 tests/integration/defs/accuracy/references/gsm8k.yaml1-50
Figure 5: Distributed Execution Patterns
TensorRT-LLM supports five parallelism dimensions that can be combined:
Sources: tests/integration/defs/accuracy/test_llm_api_pytorch.py173-191 tests/integration/test_lists/test-db/l0_dgx_h100.yml119-128
Figure 6: Testing Framework Organization
The testing framework validates TensorRT-LLM across multiple dimensions:
Test configurations are managed through YAML files that specify hardware requirements, GPU types, and test sharding.
Sources: tests/integration/test_lists/test-db/l0_h100.yml1-50 tests/integration/defs/accuracy/test_llm_api_pytorch.py1-100 tests/integration/test_lists/waives.txt1-50
Figure 7: Request Execution Flow
A typical inference request flows through these stages:
ExecutorRequestQueueRequestScheduler selects requests based on available resourcesResourceManager allocates KV cache blocks and sequence slotsModelEngine runs model forward pass through attention and FFN layersSampler generates next tokens using configured sampling strategyRequestOutputSources: tests/integration/defs/test_e2e.py446-608 tests/integration/defs/accuracy/test_llm_api_pytorch.py78-169
For typical usage, the workflow is:
Install TensorRT-LLM: See Installation and Dependencies for setup instructions
Load a model using Python API:
For detailed configuration options and advanced features, refer to the specific subsystem documentation linked at the top of this page.
Sources: tests/integration/defs/accuracy/test_llm_api_pytorch.py78-169 tests/integration/defs/test_e2e.py446-608 README.md1-100
Refresh this wiki
This wiki was recently refreshed. Please wait 3 days to refresh again.