Ollama Performance Tuning: GPU Optimization Techniques

Ollama Performance Tuning: Boost Your GPU Efficiency

Running large language models (LLMs) with Ollama in production environments demands careful GPU optimization to achieve maximum throughput, minimize latency, and control costs. This comprehensive guide explores advanced techniques for tuning Ollama’s GPU performance, from hardware configuration to runtime optimizations.

Understanding Ollama’s GPU Architecture

Ollama leverages GPU acceleration through llama.cpp’s CUDA and ROCm backends, dynamically managing VRAM allocation and compute resources. Before diving into optimization, it’s crucial to understand how Ollama utilizes GPU resources:

Model Loading: Ollama loads model weights into VRAM, with partial offloading to system RAM when necessary
Context Management: The KV cache stores attention key-value pairs, consuming VRAM proportional to context length
Batch Processing: Multiple requests can be batched for improved throughput
Layer Offloading: Individual transformer layers can be distributed between GPU and CPU

GPU Memory Optimization Strategies

1. Configuring GPU Layer Offloading

The most impactful optimization is controlling how many model layers run on the GPU versus CPU. Use the num_gpu parameter to fine-tune this balance:

# Check available VRAM
nvidia-smi --query-gpu=memory.free --format=csv,noheader,nounits

# Run with specific GPU layer count
ollama run llama2:13b --num-gpu 35

# Set via environment variable
export OLLAMA_NUM_GPU=40
ollama serve

For Modelfile configurations, specify GPU layers directly:

FROM llama2:13b
PARAMETER num_gpu 35
PARAMETER num_thread 8

2. VRAM Allocation and Context Windows

Context window size dramatically affects VRAM consumption. Calculate optimal context length using this formula:

VRAM_needed = model_size + (context_length × batch_size × hidden_dim × num_layers × 2 bytes)

# Set context window via API
curl http://localhost:11434/api/generate -d '{
  "model": "llama2:13b",
  "prompt": "Your prompt here",
  "options": {
    "num_ctx": 4096,
    "num_batch": 512,
    "num_gpu": 40
  }
}'

3. Multi-GPU Configuration

For systems with multiple GPUs, Ollama automatically distributes layers across available devices. Control GPU selection with CUDA environment variables:

# Use specific GPUs
export CUDA_VISIBLE_DEVICES=0,1
ollama serve

# Check GPU utilization
watch -n 1 nvidia-smi

# Monitor per-GPU memory usage
nvidia-smi dmon -s mu

Runtime Performance Optimization

Batch Processing Configuration

Batch size affects both throughput and latency. Larger batches improve GPU utilization but increase response time:

# Optimal batch configuration for throughput
export OLLAMA_NUM_PARALLEL=4
export OLLAMA_MAX_LOADED_MODELS=2
ollama serve

Create a systemd service with optimized parameters:

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_MAX_LOADED_MODELS=2"
Environment="OLLAMA_FLASH_ATTENTION=1"

[Install]
WantedBy=default.target

Flash Attention Enablement

Flash Attention 2.0 reduces memory footprint and accelerates inference on compatible GPUs (Ampere and newer):

# Enable Flash Attention
export OLLAMA_FLASH_ATTENTION=1

# Verify Flash Attention is active (check logs)
ollama serve 2>&1 | grep -i "flash"

Kubernetes Deployment Optimization

Deploying Ollama on Kubernetes requires careful resource allocation and GPU scheduling:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama-gpu
  namespace: ai-inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
        env:
        - name: OLLAMA_NUM_PARALLEL
          value: "4"
        - name: OLLAMA_MAX_LOADED_MODELS
          value: "2"
        - name: OLLAMA_FLASH_ATTENTION
          value: "1"
        resources:
          limits:
            nvidia.com/gpu: "1"
            memory: "24Gi"
          requests:
            nvidia.com/gpu: "1"
            memory: "16Gi"
        volumeMounts:
        - name: ollama-data
          mountPath: /root/.ollama
      volumes:
      - name: ollama-data
        persistentVolumeClaim:
          claimName: ollama-pvc
      nodeSelector:
        gpu-type: nvidia-a100

GPU Time-Slicing for Cost Optimization

Enable NVIDIA GPU time-slicing to share GPUs across multiple pods:

apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |
    version: v1
    sharing:
      timeSlicing:
        replicas: 4
        renameByDefault: true

Monitoring and Profiling

Performance Metrics Collection

Implement comprehensive monitoring to identify bottlenecks:

import requests
import time
import psutil
import pynvml

pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)

def benchmark_ollama(model, prompt, iterations=10):
    metrics = []
    
    for i in range(iterations):
        # GPU metrics before
        gpu_mem_before = pynvml.nvmlDeviceGetMemoryInfo(handle).used
        
        start_time = time.time()
        response = requests.post('http://localhost:11434/api/generate',
            json={
                'model': model,
                'prompt': prompt,
                'stream': False
            })
        end_time = time.time()
        
        # GPU metrics after
        gpu_mem_after = pynvml.nvmlDeviceGetMemoryInfo(handle).used
        gpu_util = pynvml.nvmlDeviceGetUtilizationRates(handle).gpu
        
        metrics.append({
            'latency': end_time - start_time,
            'gpu_memory_delta': gpu_mem_after - gpu_mem_before,
            'gpu_utilization': gpu_util,
            'tokens': response.json().get('eval_count', 0)
        })
    
    return metrics

# Run benchmark
results = benchmark_ollama('llama2:13b', 'Explain quantum computing', 10)
avg_latency = sum(r['latency'] for r in results) / len(results)
avg_tokens_per_sec = sum(r['tokens']/r['latency'] for r in results) / len(results)

print(f"Average Latency: {avg_latency:.2f}s")
print(f"Average Throughput: {avg_tokens_per_sec:.2f} tokens/sec")

Profiling with NVIDIA Nsight

# Profile Ollama with Nsight Systems
nsys profile --trace=cuda,nvtx --output=ollama-profile \
  ollama run llama2:13b "Explain machine learning"

# Analyze CUDA kernels
ncu --set full --target-processes all \
  ollama run llama2:13b "Test prompt"

Troubleshooting Common GPU Issues

Out of Memory (OOM) Errors

When encountering CUDA OOM errors, systematically reduce resource consumption:

# 1. Reduce context window
curl http://localhost:11434/api/generate -d '{
  "model": "llama2:13b",
  "prompt": "test",
  "options": {"num_ctx": 2048}
}'

# 2. Decrease GPU layers
export OLLAMA_NUM_GPU=30

# 3. Clear GPU memory
ollama stop <model-name>

# 4. Force garbage collection
nvidia-smi --gpu-reset

Low GPU Utilization

If GPU utilization remains below 80%, investigate these factors:

CPU Bottleneck: Increase num_thread parameter
Small Batch Size: Increase OLLAMA_NUM_PARALLEL
I/O Wait: Use faster storage (NVMe) for model files
Network Latency: Optimize request batching

# Diagnose bottlenecks
ollama serve &
PID=$!

# Monitor CPU usage
perf stat -p $PID -e cycles,instructions,cache-misses

# Check I/O wait
iostat -x 1

Advanced Optimization Techniques

Quantization Strategies

Use quantized models to reduce VRAM requirements while maintaining acceptable quality:

# Pull quantized versions
ollama pull llama2:13b-q4_K_M  # 4-bit quantization
ollama pull llama2:13b-q5_K_M  # 5-bit quantization

# Compare performance
for quant in q4_K_M q5_K_M q8_0; do
  echo "Testing llama2:13b-$quant"
  time ollama run llama2:13b-$quant "Write a haiku"
done

Custom CUDA Kernel Optimization

For advanced users, compile Ollama with custom CUDA architectures:

# Clone and build with specific CUDA compute capability
git clone https://github.com/ollama/ollama.git
cd ollama

# Build for specific GPU architecture (e.g., A100 = sm_80)
export CUDA_ARCHITECTURES="80"
export CMAKE_CUDA_FLAGS="-arch=sm_80"
go generate ./...
go build .

Best Practices Summary

Right-size your models: Use the smallest model that meets quality requirements
Monitor continuously: Track GPU utilization, memory usage, and latency
Tune iteratively: Adjust parameters based on actual workload patterns
Use quantization: Q4 or Q5 models offer excellent quality/performance tradeoffs
Enable Flash Attention: Free performance boost on modern GPUs
Batch requests: Maximize throughput by processing multiple requests simultaneously
Profile before optimizing: Use profiling tools to identify actual bottlenecks

Conclusion

Optimizing Ollama’s GPU performance requires a holistic approach combining hardware configuration, runtime parameters, and workload characteristics. By implementing these techniques—from layer offloading and batch processing to Flash Attention and quantization—you can achieve 2-3x performance improvements while reducing infrastructure costs. Start with the low-hanging fruit (GPU layer count, context windows), measure the impact, and progressively implement advanced optimizations based on your specific use case.

Remember that optimal configuration varies by model size, GPU type, and workload pattern. Continuous monitoring and iterative tuning are essential for maintaining peak performance in production environments.

Ollama Performance Tuning: GPU Optimization Techniques for Production