Ollama Performance Tuning: Boost Your GPU Efficiency
Running large language models (LLMs) with Ollama in production environments demands careful GPU optimization to achieve maximum throughput, minimize latency, and control costs. This comprehensive guide explores advanced techniques for tuning Ollama’s GPU performance, from hardware configuration to runtime optimizations.
Understanding Ollama’s GPU Architecture
Ollama leverages GPU acceleration through llama.cpp’s CUDA and ROCm backends, dynamically managing VRAM allocation and compute resources. Before diving into optimization, it’s crucial to understand how Ollama utilizes GPU resources:
- Model Loading: Ollama loads model weights into VRAM, with partial offloading to system RAM when necessary
- Context Management: The KV cache stores attention key-value pairs, consuming VRAM proportional to context length
- Batch Processing: Multiple requests can be batched for improved throughput
- Layer Offloading: Individual transformer layers can be distributed between GPU and CPU
GPU Memory Optimization Strategies
1. Configuring GPU Layer Offloading
The most impactful optimization is controlling how many model layers run on the GPU versus CPU. Use the num_gpu parameter to fine-tune this balance:
# Check available VRAM
nvidia-smi --query-gpu=memory.free --format=csv,noheader,nounits
# Run with specific GPU layer count
ollama run llama2:13b --num-gpu 35
# Set via environment variable
export OLLAMA_NUM_GPU=40
ollama serve
For Modelfile configurations, specify GPU layers directly:
FROM llama2:13b
PARAMETER num_gpu 35
PARAMETER num_thread 8
2. VRAM Allocation and Context Windows
Context window size dramatically affects VRAM consumption. Calculate optimal context length using this formula:
VRAM_needed = model_size + (context_length × batch_size × hidden_dim × num_layers × 2 bytes)
# Set context window via API
curl http://localhost:11434/api/generate -d '{
"model": "llama2:13b",
"prompt": "Your prompt here",
"options": {
"num_ctx": 4096,
"num_batch": 512,
"num_gpu": 40
}
}'
3. Multi-GPU Configuration
For systems with multiple GPUs, Ollama automatically distributes layers across available devices. Control GPU selection with CUDA environment variables:
# Use specific GPUs
export CUDA_VISIBLE_DEVICES=0,1
ollama serve
# Check GPU utilization
watch -n 1 nvidia-smi
# Monitor per-GPU memory usage
nvidia-smi dmon -s mu
Runtime Performance Optimization
Batch Processing Configuration
Batch size affects both throughput and latency. Larger batches improve GPU utilization but increase response time:
# Optimal batch configuration for throughput
export OLLAMA_NUM_PARALLEL=4
export OLLAMA_MAX_LOADED_MODELS=2
ollama serve
Create a systemd service with optimized parameters:
[Unit]
Description=Ollama Service
After=network-online.target
[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_MAX_LOADED_MODELS=2"
Environment="OLLAMA_FLASH_ATTENTION=1"
[Install]
WantedBy=default.target
Flash Attention Enablement
Flash Attention 2.0 reduces memory footprint and accelerates inference on compatible GPUs (Ampere and newer):
# Enable Flash Attention
export OLLAMA_FLASH_ATTENTION=1
# Verify Flash Attention is active (check logs)
ollama serve 2>&1 | grep -i "flash"
Kubernetes Deployment Optimization
Deploying Ollama on Kubernetes requires careful resource allocation and GPU scheduling:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama-gpu
namespace: ai-inference
spec:
replicas: 2
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
env:
- name: OLLAMA_NUM_PARALLEL
value: "4"
- name: OLLAMA_MAX_LOADED_MODELS
value: "2"
- name: OLLAMA_FLASH_ATTENTION
value: "1"
resources:
limits:
nvidia.com/gpu: "1"
memory: "24Gi"
requests:
nvidia.com/gpu: "1"
memory: "16Gi"
volumeMounts:
- name: ollama-data
mountPath: /root/.ollama
volumes:
- name: ollama-data
persistentVolumeClaim:
claimName: ollama-pvc
nodeSelector:
gpu-type: nvidia-a100
GPU Time-Slicing for Cost Optimization
Enable NVIDIA GPU time-slicing to share GPUs across multiple pods:
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
namespace: gpu-operator
data:
any: |
version: v1
sharing:
timeSlicing:
replicas: 4
renameByDefault: true
Monitoring and Profiling
Performance Metrics Collection
Implement comprehensive monitoring to identify bottlenecks:
import requests
import time
import psutil
import pynvml
pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
def benchmark_ollama(model, prompt, iterations=10):
metrics = []
for i in range(iterations):
# GPU metrics before
gpu_mem_before = pynvml.nvmlDeviceGetMemoryInfo(handle).used
start_time = time.time()
response = requests.post('http://localhost:11434/api/generate',
json={
'model': model,
'prompt': prompt,
'stream': False
})
end_time = time.time()
# GPU metrics after
gpu_mem_after = pynvml.nvmlDeviceGetMemoryInfo(handle).used
gpu_util = pynvml.nvmlDeviceGetUtilizationRates(handle).gpu
metrics.append({
'latency': end_time - start_time,
'gpu_memory_delta': gpu_mem_after - gpu_mem_before,
'gpu_utilization': gpu_util,
'tokens': response.json().get('eval_count', 0)
})
return metrics
# Run benchmark
results = benchmark_ollama('llama2:13b', 'Explain quantum computing', 10)
avg_latency = sum(r['latency'] for r in results) / len(results)
avg_tokens_per_sec = sum(r['tokens']/r['latency'] for r in results) / len(results)
print(f"Average Latency: {avg_latency:.2f}s")
print(f"Average Throughput: {avg_tokens_per_sec:.2f} tokens/sec")
Profiling with NVIDIA Nsight
# Profile Ollama with Nsight Systems
nsys profile --trace=cuda,nvtx --output=ollama-profile \
ollama run llama2:13b "Explain machine learning"
# Analyze CUDA kernels
ncu --set full --target-processes all \
ollama run llama2:13b "Test prompt"
Troubleshooting Common GPU Issues
Out of Memory (OOM) Errors
When encountering CUDA OOM errors, systematically reduce resource consumption:
# 1. Reduce context window
curl http://localhost:11434/api/generate -d '{
"model": "llama2:13b",
"prompt": "test",
"options": {"num_ctx": 2048}
}'
# 2. Decrease GPU layers
export OLLAMA_NUM_GPU=30
# 3. Clear GPU memory
ollama stop <model-name>
# 4. Force garbage collection
nvidia-smi --gpu-reset
Low GPU Utilization
If GPU utilization remains below 80%, investigate these factors:
- CPU Bottleneck: Increase
num_threadparameter - Small Batch Size: Increase
OLLAMA_NUM_PARALLEL - I/O Wait: Use faster storage (NVMe) for model files
- Network Latency: Optimize request batching
# Diagnose bottlenecks
ollama serve &
PID=$!
# Monitor CPU usage
perf stat -p $PID -e cycles,instructions,cache-misses
# Check I/O wait
iostat -x 1
Advanced Optimization Techniques
Quantization Strategies
Use quantized models to reduce VRAM requirements while maintaining acceptable quality:
# Pull quantized versions
ollama pull llama2:13b-q4_K_M # 4-bit quantization
ollama pull llama2:13b-q5_K_M # 5-bit quantization
# Compare performance
for quant in q4_K_M q5_K_M q8_0; do
echo "Testing llama2:13b-$quant"
time ollama run llama2:13b-$quant "Write a haiku"
done
Custom CUDA Kernel Optimization
For advanced users, compile Ollama with custom CUDA architectures:
# Clone and build with specific CUDA compute capability
git clone https://github.com/ollama/ollama.git
cd ollama
# Build for specific GPU architecture (e.g., A100 = sm_80)
export CUDA_ARCHITECTURES="80"
export CMAKE_CUDA_FLAGS="-arch=sm_80"
go generate ./...
go build .
Best Practices Summary
- Right-size your models: Use the smallest model that meets quality requirements
- Monitor continuously: Track GPU utilization, memory usage, and latency
- Tune iteratively: Adjust parameters based on actual workload patterns
- Use quantization: Q4 or Q5 models offer excellent quality/performance tradeoffs
- Enable Flash Attention: Free performance boost on modern GPUs
- Batch requests: Maximize throughput by processing multiple requests simultaneously
- Profile before optimizing: Use profiling tools to identify actual bottlenecks
Conclusion
Optimizing Ollama’s GPU performance requires a holistic approach combining hardware configuration, runtime parameters, and workload characteristics. By implementing these techniques—from layer offloading and batch processing to Flash Attention and quantization—you can achieve 2-3x performance improvements while reducing infrastructure costs. Start with the low-hanging fruit (GPU layer count, context windows), measure the impact, and progressively implement advanced optimizations based on your specific use case.
Remember that optimal configuration varies by model size, GPU type, and workload pattern. Continuous monitoring and iterative tuning are essential for maintaining peak performance in production environments.