Overview

Relevant source files

TensorRT-LLM is a high-performance inference library for Large Language Models (LLMs) on NVIDIA GPUs. It provides Python APIs, command-line tools, and serving infrastructure to optimize and deploy LLMs with state-of-the-art performance through model quantization, distributed execution strategies, and advanced inference techniques.

This document provides a high-level overview of TensorRT-LLM's architecture, capabilities, and components. For detailed information about specific subsystems:

Installation and environment setup: see Installation and Dependencies
Python API usage: see LLM API and User Interface
Command-line tools: see Command-Line Tools
Runtime execution: see PyExecutor Runtime System
Model architecture: see Model Architecture and Components
Production serving: see Serving Infrastructure
Testing framework: see Testing Infrastructure

Sources: README.md1-220 tensorrt_llm/version.py1-16 tensorrt_llm/__init__.py1-50

Core Capabilities

TensorRT-LLM delivers optimized LLM inference through several key capabilities:

Capability	Description	Key Techniques
Quantization	Reduce model memory footprint and increase throughput	FP8, FP4, NVFP4, INT8, INT4 AWQ, SmoothQuant
Distributed Execution	Scale to large models across multiple GPUs/nodes	Tensor Parallel (TP), Pipeline Parallel (PP), Expert Parallel (EP), Context Parallel (CP)
Speculative Decoding	Accelerate generation through multi-token prediction	Eagle3, MTP (Multi-Token Prediction), N-Gram
Attention Optimization	Efficient attention computation	FlashAttention, PagedAttention, Multi-head Latent Attention (MLA)
Kernel Optimization	Hardware-specific kernel selection and fusion	AutoTuner, CUDA Graphs, torch.compile integration
Memory Management	Efficient KV cache and memory allocation	Block reuse, chunked prefill, paged KV cache

Sources: README.md1-220 tests/integration/defs/accuracy/test_llm_api_pytorch.py1-100

System Architecture Overview

Figure 1: High-Level System Architecture

The architecture consists of five main layers:

User Entry Points: CLI tools (trtllm-build, trtllm-bench, trtllm-serve), Python API (tensorrt_llm.LLM), and OpenAI-compatible server
Configuration Management: Structured configuration classes for model setup, sampling, quantization, and engine building
Execution Orchestration: PyExecutor coordinates request scheduling, resource allocation, and model execution
Backend Engines: Multiple execution backends (PyTorch, TensorRT, AutoDeploy) for different performance/compatibility trade-offs
Model & Kernels: Core model implementations with optimized attention, MoE, and quantization kernels

Sources: tensorrt_llm/__init__.py1-50 README.md1-100

Entry Points and User Interfaces

Python API Entry Point

Figure 2: Python API Entry Point

The LLM class in tensorrt_llm is the primary programmatic interface. Users instantiate it with a model path and optional configuration, then call generate() or generate_async() for inference.

Sources: tensorrt_llm/__init__.py1-50 tests/integration/defs/accuracy/test_llm_api_pytorch.py78-169

Command-Line Tools

Figure 3: Command-Line Tool Hierarchy

Three main CLI tools provide access to TensorRT-LLM functionality:

trtllm-build: Converts model checkpoints to optimized TensorRT engines
trtllm-bench: Benchmarks model performance with dataset preparation and throughput/latency measurement
trtllm-serve: Launches an inference server with OpenAI-compatible API endpoints

Sources: tests/integration/defs/test_e2e.py446-608 README.md1-100

Model Support Matrix

TensorRT-LLM supports a wide range of modern LLM architectures:

Model Family	Example Models	Special Features
Llama	Llama 3.1, 3.2, 3.3, 4 Maverick, 4 Scout	RoPE, GQA, SwiGLU
DeepSeek	DeepSeek-V3, DeepSeek-V3-Lite, DeepSeek-R1	Multi-head Latent Attention, MoE
Qwen	Qwen3-8B, Qwen3-30B-A3B, Qwen3-235B-A22B	MoE, Eagle3 support
GPT-OSS	GPT-OSS-120B, GPT-OSS-20B	W4AFP8 quantization
Mixtral	Mixtral-8x7B	Sparse MoE
Gemma	Gemma-2, Gemma-3	Multi-query attention
Mistral	Mistral-7B, Mistral-Large	Sliding window attention
Multimodal	Llama-3.2-11B-Vision, Qwen2-VL	Vision-language models

Sources: tests/integration/defs/accuracy/references/gsm8k.yaml1-100 tests/integration/defs/accuracy/references/mmlu.yaml1-100 tests/integration/test_lists/test-db/l0_h100.yml30-90

Quantization Support

Figure 4: Quantization Algorithm Hierarchy

TensorRT-LLM supports multiple quantization algorithms optimized for different hardware generations:

FP8: 8-bit floating point (Ada/Hopper+)
NVFP4: 4-bit format optimized for Blackwell GPUs
INT4/INT8: Integer quantization with AWQ or SmoothQuant
W4A8: Mixed precision 4-bit weights, 8-bit activations

Each algorithm can be combined with KV cache quantization for additional memory savings.

Sources: tests/integration/defs/accuracy/test_llm_api_pytorch.py91-98 tests/integration/defs/accuracy/references/gsm8k.yaml1-50

Distributed Execution Strategies

Figure 5: Distributed Execution Patterns

TensorRT-LLM supports five parallelism dimensions that can be combined:

Tensor Parallel (TP): Splits individual layers across GPUs (attention heads, MLP dimensions)
Pipeline Parallel (PP): Distributes consecutive layers to different GPUs
Expert Parallel (EP): Distributes MoE experts across GPUs with optimized communication
Context Parallel (CP): Splits long sequences for processing with ring or helix attention
Data Parallel (DP): Runs independent replicas, useful for throughput-oriented serving

Sources: tests/integration/defs/accuracy/test_llm_api_pytorch.py173-191 tests/integration/test_lists/test-db/l0_dgx_h100.yml119-128

Testing and Validation Framework

Figure 6: Testing Framework Organization

The testing framework validates TensorRT-LLM across multiple dimensions:

Accuracy Tests: Verify model output quality on standard benchmarks (MMLU, GSM8K, CNN/DailyMail)
Integration Tests: Validate end-to-end workflows including serving, distributed execution, and speculative decoding
Unit Tests: Test individual components like attention kernels, MoE layers, and quantization methods
Performance Tests: Measure throughput, latency, and resource usage

Test configurations are managed through YAML files that specify hardware requirements, GPU types, and test sharding.

Sources: tests/integration/test_lists/test-db/l0_h100.yml1-50 tests/integration/defs/accuracy/test_llm_api_pytorch.py1-100 tests/integration/test_lists/waives.txt1-50

Execution Flow Overview

Figure 7: Request Execution Flow

A typical inference request flows through these stages:

Preprocessing: Tokenize input prompts using the model's tokenizer
Queueing: Add request to ExecutorRequestQueue
Scheduling: RequestScheduler selects requests based on available resources
Resource Allocation: ResourceManager allocates KV cache blocks and sequence slots
Execution: ModelEngine runs model forward pass through attention and FFN layers
Sampling: Sampler generates next tokens using configured sampling strategy
Iteration: Repeat steps 3-6 until generation completes
Postprocessing: Detokenize generated tokens and return RequestOutput

Sources: tests/integration/defs/test_e2e.py446-608 tests/integration/defs/accuracy/test_llm_api_pytorch.py78-169

Getting Started

For typical usage, the workflow is:

Install TensorRT-LLM: See Installation and Dependencies for setup instructions
Load a model using Python API:

Or use command-line tools:

For distributed execution:

For detailed configuration options and advanced features, refer to the specific subsystem documentation linked at the top of this page.

Sources: tests/integration/defs/accuracy/test_llm_api_pytorch.py78-169 tests/integration/defs/test_e2e.py446-608 README.md1-100

Overview

Relevant source files

This document provides a high-level overview of TensorRT-LLM's architecture, capabilities, and components. For detailed information about specific subsystems:

Installation and environment setup: see Installation and Dependencies
Python API usage: see LLM API and User Interface
Command-line tools: see Command-Line Tools
Runtime execution: see PyExecutor Runtime System
Model architecture: see Model Architecture and Components
Production serving: see Serving Infrastructure
Testing framework: see Testing Infrastructure

Sources: README.md1-220 tensorrt_llm/version.py1-16 tensorrt_llm/__init__.py1-50

Core Capabilities

TensorRT-LLM delivers optimized LLM inference through several key capabilities:

Capability	Description	Key Techniques
Quantization	Reduce model memory footprint and increase throughput	FP8, FP4, NVFP4, INT8, INT4 AWQ, SmoothQuant
Distributed Execution	Scale to large models across multiple GPUs/nodes	Tensor Parallel (TP), Pipeline Parallel (PP), Expert Parallel (EP), Context Parallel (CP)
Speculative Decoding	Accelerate generation through multi-token prediction	Eagle3, MTP (Multi-Token Prediction), N-Gram
Attention Optimization	Efficient attention computation	FlashAttention, PagedAttention, Multi-head Latent Attention (MLA)
Kernel Optimization	Hardware-specific kernel selection and fusion	AutoTuner, CUDA Graphs, torch.compile integration
Memory Management	Efficient KV cache and memory allocation	Block reuse, chunked prefill, paged KV cache

Sources: README.md1-220 tests/integration/defs/accuracy/test_llm_api_pytorch.py1-100

System Architecture Overview

Figure 1: High-Level System Architecture

The architecture consists of five main layers:

User Entry Points: CLI tools (trtllm-build, trtllm-bench, trtllm-serve), Python API (tensorrt_llm.LLM), and OpenAI-compatible server
Configuration Management: Structured configuration classes for model setup, sampling, quantization, and engine building
Execution Orchestration: PyExecutor coordinates request scheduling, resource allocation, and model execution
Backend Engines: Multiple execution backends (PyTorch, TensorRT, AutoDeploy) for different performance/compatibility trade-offs
Model & Kernels: Core model implementations with optimized attention, MoE, and quantization kernels

Sources: tensorrt_llm/__init__.py1-50 README.md1-100

Entry Points and User Interfaces

Python API Entry Point

Figure 2: Python API Entry Point

The LLM class in tensorrt_llm is the primary programmatic interface. Users instantiate it with a model path and optional configuration, then call generate() or generate_async() for inference.

Sources: tensorrt_llm/__init__.py1-50 tests/integration/defs/accuracy/test_llm_api_pytorch.py78-169

Command-Line Tools

Figure 3: Command-Line Tool Hierarchy

Three main CLI tools provide access to TensorRT-LLM functionality:

trtllm-build: Converts model checkpoints to optimized TensorRT engines
trtllm-bench: Benchmarks model performance with dataset preparation and throughput/latency measurement
trtllm-serve: Launches an inference server with OpenAI-compatible API endpoints

Sources: tests/integration/defs/test_e2e.py446-608 README.md1-100

Model Support Matrix

TensorRT-LLM supports a wide range of modern LLM architectures:

Model Family	Example Models	Special Features
Llama	Llama 3.1, 3.2, 3.3, 4 Maverick, 4 Scout	RoPE, GQA, SwiGLU
DeepSeek	DeepSeek-V3, DeepSeek-V3-Lite, DeepSeek-R1	Multi-head Latent Attention, MoE
Qwen	Qwen3-8B, Qwen3-30B-A3B, Qwen3-235B-A22B	MoE, Eagle3 support
GPT-OSS	GPT-OSS-120B, GPT-OSS-20B	W4AFP8 quantization
Mixtral	Mixtral-8x7B	Sparse MoE
Gemma	Gemma-2, Gemma-3	Multi-query attention
Mistral	Mistral-7B, Mistral-Large	Sliding window attention
Multimodal	Llama-3.2-11B-Vision, Qwen2-VL	Vision-language models

Sources: tests/integration/defs/accuracy/references/gsm8k.yaml1-100 tests/integration/defs/accuracy/references/mmlu.yaml1-100 tests/integration/test_lists/test-db/l0_h100.yml30-90

Quantization Support

Figure 4: Quantization Algorithm Hierarchy

TensorRT-LLM supports multiple quantization algorithms optimized for different hardware generations:

FP8: 8-bit floating point (Ada/Hopper+)
NVFP4: 4-bit format optimized for Blackwell GPUs
INT4/INT8: Integer quantization with AWQ or SmoothQuant
W4A8: Mixed precision 4-bit weights, 8-bit activations

Each algorithm can be combined with KV cache quantization for additional memory savings.

Sources: tests/integration/defs/accuracy/test_llm_api_pytorch.py91-98 tests/integration/defs/accuracy/references/gsm8k.yaml1-50

Distributed Execution Strategies

Figure 5: Distributed Execution Patterns

TensorRT-LLM supports five parallelism dimensions that can be combined:

Tensor Parallel (TP): Splits individual layers across GPUs (attention heads, MLP dimensions)
Pipeline Parallel (PP): Distributes consecutive layers to different GPUs
Expert Parallel (EP): Distributes MoE experts across GPUs with optimized communication
Context Parallel (CP): Splits long sequences for processing with ring or helix attention
Data Parallel (DP): Runs independent replicas, useful for throughput-oriented serving

Sources: tests/integration/defs/accuracy/test_llm_api_pytorch.py173-191 tests/integration/test_lists/test-db/l0_dgx_h100.yml119-128

Testing and Validation Framework

Figure 6: Testing Framework Organization

The testing framework validates TensorRT-LLM across multiple dimensions:

Accuracy Tests: Verify model output quality on standard benchmarks (MMLU, GSM8K, CNN/DailyMail)
Integration Tests: Validate end-to-end workflows including serving, distributed execution, and speculative decoding
Unit Tests: Test individual components like attention kernels, MoE layers, and quantization methods
Performance Tests: Measure throughput, latency, and resource usage

Test configurations are managed through YAML files that specify hardware requirements, GPU types, and test sharding.

Sources: tests/integration/test_lists/test-db/l0_h100.yml1-50 tests/integration/defs/accuracy/test_llm_api_pytorch.py1-100 tests/integration/test_lists/waives.txt1-50

Execution Flow Overview

Figure 7: Request Execution Flow

A typical inference request flows through these stages:

Preprocessing: Tokenize input prompts using the model's tokenizer
Queueing: Add request to ExecutorRequestQueue
Scheduling: RequestScheduler selects requests based on available resources
Resource Allocation: ResourceManager allocates KV cache blocks and sequence slots
Execution: ModelEngine runs model forward pass through attention and FFN layers
Sampling: Sampler generates next tokens using configured sampling strategy
Iteration: Repeat steps 3-6 until generation completes
Postprocessing: Detokenize generated tokens and return RequestOutput

Sources: tests/integration/defs/test_e2e.py446-608 tests/integration/defs/accuracy/test_llm_api_pytorch.py78-169

Getting Started

For typical usage, the workflow is:

Install TensorRT-LLM: See Installation and Dependencies for setup instructions
Load a model using Python API:

Or use command-line tools:

For distributed execution:

For detailed configuration options and advanced features, refer to the specific subsystem documentation linked at the top of this page.

Sources: tests/integration/defs/accuracy/test_llm_api_pytorch.py78-169 tests/integration/defs/test_e2e.py446-608 README.md1-100

Overview

Core Capabilities

System Architecture Overview

Entry Points and User Interfaces

Python API Entry Point

Command-Line Tools

Model Support Matrix

Quantization Support

Distributed Execution Strategies

Testing and Validation Framework

Execution Flow Overview

Getting Started

On this page

Overview

Core Capabilities

System Architecture Overview

Entry Points and User Interfaces

Python API Entry Point

Command-Line Tools

Model Support Matrix

Quantization Support

Distributed Execution Strategies

Testing and Validation Framework

Execution Flow Overview

Getting Started

On this page