Skip to content

akandr/bc250

Repository files navigation

 ██████╗  ██████╗       ██████╗ ███████╗ ██████╗
 ██╔══██╗██╔════╝       ╚════██╗██╔════╝██╔═████╗
 ██████╔╝██║      █████╗ █████╔╝███████╗██║██╔██║
 ██╔══██╗██║      ╚════╝██╔═══╝ ╚════██║████╔╝██║
 ██████╔╝╚██████╗       ███████╗███████║╚██████╔╝
 ╚═════╝  ╚═════╝       ╚══════╝╚══════╝ ╚═════╝

GPU-accelerated AI home server on an obscure AMD APU — Vulkan inference, autonomous intelligence, Signal chat

Zen 2 · GFX1013 ("RDNA 1.5", informal) · 16 GB unified · Vulkan · 35B MoE @ 37.5 tok/s · 256K alloc / 32K practical filled ctx · 330 autonomous jobs/cycle · 130 dashboard pages

Code: AGPL v3 Docs: CC BY-SA 4.0

BC-250 test platform

The BC-250 powered by an ATX supply, cooled by a broken AIO radiator with 3 fans just sitting on top of it. Somehow runs 24/7 without issues so far.

A complete guide to running a 35-billion-parameter language model (Mixture-of-Experts architecture), FLUX.2 image generation, and 330 autonomous jobs on the AMD BC-250 — a crypto-mining board built around AMD's Cyan Skillfish APU (Zen 2 + GFX1013 GPU, 16 GB GDDR6), often associated by the community with the PS5's silicon lineage (Phoronix, LLVM AMDGPU), repurposed as a headless AI server with a community-patched BIOS.

35B MoE at 37.5 tok/s (tokens/second) with a 256K allocation ceiling and 32K practical filled context (64K allocation default), FLUX.2-klein-9B as the preferred image model from side-by-side testing, hardware-specific driver workarounds, memory tuning notes, and real-world benchmarks on this niche hardware. If you're new to LLM terminology, see the glossary below.

What makes this unusual: This document describes one public, real-world LLM inference deployment on BC-250 / GFX1013 hardware — GFX10-era silicon informally called "RDNA 1.5" by the community. ROCm's userspace libraries don't ship GFX1013 support. OpenCL/rusticl was not functional in this configuration. On this Fedora 43 / Mesa 25.3.4 stack, Vulkan was the only GPU compute path that proved usable — and even that required working around two kernel memory bottlenecks (GTT cap + TTM pages_limit) before 14B models would run.

Disclaimer: Unless otherwise stated, performance figures in this document are local measurements from one BC-250 board running Fedora 43, Mesa 25.3.4, and Ollama 0.18.0 with specific model quantizations. They are not vendor benchmarks and may not be reproducible on different software stacks.

Quick glossary — LLM inference terms used in this document
Term What it means
LLM Large Language Model — a neural network trained on text that generates responses token by token. Think of it as a stateless function: prompt in, text out.
Token The basic unit LLMs operate on. Roughly ¾ of a word in English. "Hello world" ≈ 2 tokens.
tok/s Tokens per second — the generation throughput. Higher = faster responses.
Parameters (3B, 14B, 35B) The number of trained weights in the model. More parameters generally means better quality but more memory and slower inference. A 14B model has 14 billion floating-point weights.
Quantization (Q4_0, IQ2_M, Q4_K_M) Compressing model weights from 16-bit floats to fewer bits. Q4 = 4 bits per weight (~4× smaller). IQ2_M ≈ 2.5 bits (~6× smaller). Trades precision for memory — like choosing between float32 and int8 for a DSP pipeline.
GGUF File format for quantized models (from llama.cpp). Contains weights + metadata. Analogous to a firmware binary with embedded config.
Context window / context length How many tokens the model can "see" at once (prompt + response). A 64K context = ~48K words. The model has no memory between calls — everything must fit in this window.
KV cache Key-Value cache — working memory allocated during inference to store attention state for each token in the context. Grows linearly with context length. This is the main VRAM consumer beyond model weights.
Prefill The phase where the model processes your entire prompt before generating the first output token. Speed measured in tok/s. Often compute-heavy at short prompts; at larger contexts, memory traffic becomes a major limiter.
Generation The phase where the model produces output tokens one at a time. Each new token requires reading all model weights once. Bottlenecked by memory bandwidth × parameter count.
TTFT Time To First Token — wall-clock delay from sending a prompt to receiving the first output token. Includes model load time (if cold) + prefill time.
MoE (Mixture of Experts) Architecture where only a subset of parameters activate per token. A 35B MoE with 3B active means 35B total weights in memory, but only 3B are used for each token's computation — faster than a 35B dense model, with quality closer to 35B than 3B.
Dense model A standard model where all parameters activate for every token. A 14B dense model does 14B operations per token.
Ollama Local LLM inference server. Wraps llama.cpp with an HTTP API. Manages model loading, KV cache, and GPU offload.
Think mode / thinking tokens Some models (DeepSeek-R1, Qwen3) generate internal reasoning tokens before the visible answer. These consume the output budget and context window but aren't shown to the user.

░░ Contents

§ Section What you'll find
PART I ─ HARDWARE & SETUP
1 Hardware Overview Specs, memory architecture, power
2 Driver & Compute Stack What works (Vulkan), what doesn't (ROCm)
3 Ollama + Vulkan Setup Install, GPU memory tuning (GTT + TTM)
4 Models & Benchmarks Model compatibility, speed, memory budget
4.10 ↳ Ollama vs llama.cpp TG: +45% Qwen MoE, +7% dense; 32K+ only via Ollama
PART II ─ AI STACK
5 Signal Chat Bot Chat, vision analysis, audio transcription, smart routing
6 Image Generation FLUX.2-klein-9B, synchronous pipeline
PART III ─ MONITORING & INTEL
7 Netscan Ecosystem 330 jobs, queue-runner v7, 130-page dashboard
8 Career Intelligence Two-phase scanner, salary, patents
PART IV ─ COMPREHENSIVE BENCHMARKS
B1 Methodology 5-phase suite, prompt standardization, scoring criteria
B2 Statistical Validation CV < 1.5%, single-run reliability proof
B3 Generation Speed tok/s, prefill, TTFT, VRAM (31 of 33 models)
B4 Quality Assessment 5 tasks × 3 runs, per-task breakdown, tier analysis
B5 Context Scaling Filled-context sweep, degradation, ceiling grid
B6 Long-Context Quality Fact retrieval, multi-hop reasoning, synthesis @ 16K+32K
B7 Cold-Start Timing TTFT, load speed, Signal chat latency profile
B8 Quantization Impact Q4_K_M vs Q8_0 comparison
B9 Image Generation 8 models, resolution scaling, video, upscaling
B10 Model Recommendations Best model per use case
PART V ─ REFERENCE
9 Repository Structure File layout, deployment paths
10 Troubleshooting Common issues and fixes
11 Known Limitations What's broken, what to watch out for
12 Software Versions Pinned versions of all components
13 References Links to all upstream projects and models
A OpenClaw Archive Original architecture, why it was ditched

PART I — Hardware & Setup

1. Hardware Overview

The AMD BC-250 is a crypto-mining board built by ASRock Rack around AMD's Cyan Skillfish APU — Zen 2 CPU (6c/12t) and GFX1013 GPU (24 CUs) with 16 GB GDDR6 unified memory. The Cyan Skillfish silicon is widely associated with the same hardware family as Sony's PS5 APU (Oberon), and a common community theory is that these are salvaged/binned PS5 dies that didn't meet Sony's specs. This is plausible but not publicly confirmed by AMD — treat it as informed speculation, not established fact. Based on reseller listings and community discussion, these boards were deployed in multi-board rack mining systems by ASRock Rack. After the racks were decommissioned, individual boards became available on AliExpress.

GFX1013 vs PS5: The PS5's Oberon is RDNA 2 (GFX10.3, gfx1030+). For practical purposes, the BC-250's Cyan Skillfish (gfx1013) behaves like a GFX10.1-era variant with fewer CUs than a full PS5 APU and an older ISA — though exact die-level comparisons are speculative without official AMD documentation. Unusually for GFX10.1, it retains hardware ray tracing extensions (VK_KHR_ray_tracing_pipeline, VK_KHR_ray_query). The community label "RDNA 1.5" (used throughout this document) reflects this hybrid positioning: GFX10.1 instruction set with ray tracing hardware more typical of RDNA 2. This is informal shorthand — not an official AMD designation.

BIOS is not stock. The board ships with a factory BIOS for rack operation that already includes UEFI boot and fan control. A community-patched BIOS (from AMD BC-250 docs) unlocks dynamic VRAM allocation (the 512 MB setting), custom VRAM splits, and chipset configuration menus.

Component Details
CPU Zen 2 — 6c/12t (BIOS-reported base 2.0 GHz; community docs report higher clocks on some firmware versions)
GPU Cyan Skillfish — "RDNA 1.5" (informal), GFX1013, 24 CUs (1536 SPs), ray tracing capable
Memory 16 GB GDDR6 unified (on-package, 256-bit bus), shared CPU/GPU
VRAM 512 MB BIOS-carved framebuffer (same physical UMA pool — see note below)
GTT 16 GiB (tuned via ttm.pages_limit=4194304, default 7.4 GiB)
Vulkan total 16.5 GiB after tuning
Storage 475 GB NVMe
OS Fedora 43, kernel 6.18.9, headless
TDP 220W board (inference: 130–155W, between jobs: 55–60W, true idle w/o model: ~35W)
BIOS Community-patched (unlocks dynamic VRAM allocation, chipset menus) — AMD BC-250 docs
CPU governor performance (stock schedutil causes LLM latency spikes)

Unified memory is your friend (but needs tuning)

CPU and GPU share the same 16 GB physical pool (UMA — Unified Memory Architecture). The 512 MB "dedicated framebuffer" reported by mem_info_vram_total is carved from the same physical memory — it's a BIOS reservation, not separate silicon. The rest is accessible as GTT (Graphics Translation Table).

UMA reality: On unified memory, "100% GPU offload" means the model weights and KV cache live in GTT-mapped pages that the GPU accesses directly — there's no PCIe copy. However, it's still the same physical RAM the CPU uses. "Fallback to CPU" on UMA isn't catastrophic like on discrete GPUs (no bus transfer penalty), but GPU ALUs are faster than CPU ALUs for matrix ops.

Two bottlenecks had to be fixed in this setup:

  1. GTT capamdgpu driver defaults to 50% of RAM (~7.4 GiB). The legacy fix was amdgpu.gttsize=14336 in kernel cmdline, but this parameter is now deprecated in favor of ttm.pages_limit (kernel TTM docs, Jeff Geerling's notes).
  2. TTM pages_limit — kernel TTM memory manager independently caps allocations at ~7.4 GiB. Fix: ttm.pages_limit=4194304 (16 GiB in 4K pages). On this Fedora 43 / kernel 6.18.9 stack, this is the only tuning needed. Other kernels or distros may behave differently.

GTT migration complete: amdgpu.gttsize is deprecated and was removed from this setup's kernel cmdline. With ttm.pages_limit=4194304 alone, GTT grew from 14→16 GiB and Vulkan available from 14.0→16.5 GiB. The deprecated parameter was actually limiting the allocation.

After tuning: Vulkan sees 16.5 GiB — enough for the 35B MoE primary at 32K practical filled context, or 14B dense models at up to 64K filled context (Q4_0 KV), with all tested inference running on GPU. The 64K allocation default remains — most chats use only a fraction of the context window.


2. Driver & Compute Stack

The BC-250's GFX1013 falls between supported driver tiers. BC-250/Cyan Skillfish support in Mesa/RADV has been evolving rapidly (Phoronix coverage, Mesa RADV docs) — the status below reflects this specific setup and may change with newer Mesa versions.

Layer Status Notes
amdgpu kernel driver Auto-detected, firmware loaded
Vulkan (RADV/Mesa) Mesa 25.3.4, Vulkan 1.4.328
ROCm / HIP rocblas_abort() — GFX1013 not in GPU list
OpenCL (rusticl) ⚠️ Not usable in this setup (Mesa 25.3.4 / Fedora 43). Community reports suggest evolving support.

Why ROCm fails: GFX1013 is listed in LLVM as supporting rocm-amdhsa, but AMD's ROCm userspace (rocBLAS/Tensile) doesn't ship GFX1013 solution libraries. On this Fedora 43 / Mesa 25.3.4 deployment, Vulkan was the only GPU compute path that proved usable as of early 2026. OpenCL/rusticl may work in other Mesa versions or configurations.

▸ What about HSA_OVERRIDE_GFX_VERSION?

A common suggestion for unsupported AMD GPUs is to set HSA_OVERRIDE_GFX_VERSION=10.3.0 to masquerade as gfx1030. This is not advisable for GFX1013: the BC-250 is GFX10.1-era ISA, while gfx1030 is GFX10.3 — the instruction set differences risk silent compute errors or crashes. Additionally, ROCm on AMD APUs (unified memory) lacks the Vulkan shader cache advantage: on APU hardware, the Vulkan backend in llama.cpp is typically faster on cold start and comparable on warm runs compared to ROCm, because Vulkan caches compiled shaders to disk while ROCm recompiles every launch. Since Vulkan already works and ROCm would require installing unsupported packages on Fedora 43, this path was not pursued.

▸ Verification commands
vulkaninfo --summary
# → GPU0: AMD BC-250 (RADV GFX1013), Vulkan 1.4.328, INTEGRATED_GPU

cat /sys/class/drm/card1/device/mem_info_vram_total   # → 536870912 (512 MB)
cat /sys/class/drm/card1/device/mem_info_gtt_total    # → 17179869184 (16 GiB, after TTM tuning — see §3.3)

3. Ollama + Vulkan Setup

3.1 Install and enable Vulkan

curl -fsSL https://ollama.com/install.sh | sh

# Enable Vulkan backend for this deployment via OLLAMA_VULKAN=1
sudo mkdir -p /etc/systemd/system/ollama.service.d
cat <<EOF | sudo tee /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment=OLLAMA_VULKAN=1
Environment=OLLAMA_KEEP_ALIVE=30m
Environment=OLLAMA_MAX_LOADED_MODELS=1
Environment=OLLAMA_FLASH_ATTENTION=1
Environment=OLLAMA_GPU_OVERHEAD=0
Environment=OLLAMA_CONTEXT_LENGTH=65536
Environment=OLLAMA_MAX_QUEUE=4
OOMScoreAdjust=-1000
Environment=OLLAMA_KV_CACHE_TYPE=q4_0
EOF
sudo systemctl daemon-reload && sudo systemctl restart ollama

OOMScoreAdjust=-1000 protects Ollama from the OOM killer — keeping the model process alive is the priority on a memory-constrained system (see §3.4).

On this deployment, ROCm initialization failed during Ollama startup; the runtime continued with Vulkan.

▸ Why Ollama instead of building llama.cpp directly?

A common question: "Why not build llama.cpp locally with -march=native instead of using Ollama?"

Short answer: Ollama already uses AVX2 (via its bundled libggml-cpu-haswell.so), and the CPU target is irrelevant anyway — all matrix ops run on the Vulkan GPU. Verified on this hardware:

Configuration qwen3:4b gen tok/s MoE 35B-A3B gen tok/s
Ollama 0.18 (Vulkan, 65K ctx, q4_0 KV, FA) ~74 ~37.3
llama-server (HEAD, same settings) 80.7 (+9%) 65.9 (+77%)
llama-bench native, q4_0 KV, FA 86.2 (small ctx) 79.2 (small ctx)
llama-bench haswell, q4_0 KV, FA 86.1 (small ctx)
llama-bench CPU-only (no GPU) 14.8

Reproduced 3× (llama.cpp commits 41361c8, 6307ec0). Ollama numbers from Ollama's own eval_duration timing. llama-server numbers from wall-clock over 5 runs (excluding warmup run 1).

Key findings:

  • -march=native vs -march=haswell: 0.1% difference — negligible, confirms CPU SIMD target is irrelevant when inference runs on Vulkan GPU.
  • llama.cpp HEAD vs Ollama 0.18: +9% dense, +77% Qwen MoE — from improved Vulkan shaders upstream. The Qwen MoE gain is especially large, suggesting significant shader optimization for sparse architectures. Ollama will inherit these gains in the next release.
  • No swap thrashing — llama-server with Qwen MoE 30B-A3B at 65K context used 12 GiB RAM with 1.7 GiB swap (same as baseline). The earlier "swap thrashing" finding from round 1 was due to running without flash attention and q4_0 KV.
  • Practical value: Ollama manages model loading/unloading, HTTP API, systemd integration, and KV cache lifecycle. Replacing it with raw llama-server would require reimplementing all of that, for speed gains that upstream will deliver anyway.

The llama-bench numbers that look even faster (79–89 tok/s) use small default context allocation (~640 tokens vs Ollama's 65K). Larger context = more KV cache memory = less bandwidth available for generation.

3.2 Tune GTT size

No longer needed on this setup. The deprecated amdgpu.gttsize parameter was removed from our kernel cmdline. With ttm.pages_limit=4194304 alone, GTT allocates 16 GiB (more than the old 14 GiB). Verify:

cat /sys/class/drm/card1/device/mem_info_gtt_total  # → 17179869184 (16 GiB)
# If you still have amdgpu.gttsize in cmdline, remove it:
sudo grubby --update-kernel=ALL --remove-args="amdgpu.gttsize=14336"

3.3 Tune TTM pages_limit ← unlocks 14B models

In this setup, this was the key fix. Without it, 14B models loaded fine but produced HTTP 500 during inference.

# Runtime (immediate)
echo 4194304 | sudo tee /sys/module/ttm/parameters/pages_limit
echo 4194304 | sudo tee /sys/module/ttm/parameters/page_pool_size

# Persistent
echo "options ttm pages_limit=4194304 page_pool_size=4194304" | \
  sudo tee /etc/modprobe.d/ttm-gpu-memory.conf
printf "w /sys/module/ttm/parameters/pages_limit - - - - 4194304\n\
w /sys/module/ttm/parameters/page_pool_size - - - - 4194304\n" | \
  sudo tee /etc/tmpfiles.d/gpu-ttm-memory.conf
sudo dracut -f

3.4 Context window & KV cache — the main gotcha

During inference, the model maintains a KV (Key-Value) cache — a per-token scratch buffer that grows linearly with context length. On this UMA system where CPU and GPU share the same 16 GB, KV cache competes directly with model weights for memory. Ollama allocates KV cache based on the model's declared context window. Without a cap, large models request more KV cache than the BC-250 can handle, causing TTM fragmentation, OOM kills, or deadlocks.

Fix: Set OLLAMA_CONTEXT_LENGTH=65536 in the Ollama systemd override (see §3.1). This caps the default allocation at 64K — the verified ceiling where all models can actually process a full context within acceptable time.

Critical companion fix: Set OLLAMA_KV_CACHE_TYPE=q4_0. This quantizes the KV cache to 4-bit, reducing KV memory by ~4× compared to FP16. On this hardware, this single setting raises the context ceiling from 16–64K (FP16) to much larger allocations — but see the important distinction between allocation and filled context in the extended benchmark (§4.5).

# In /etc/systemd/system/ollama.service.d/override.conf:
Environment=OLLAMA_KV_CACHE_TYPE=q4_0
Environment=OLLAMA_CONTEXT_LENGTH=65536

How we got to 65536: Started with FP16 KV at 40K context — caused TTM deadlocks. Dropped to 24K (sweet spot for FP16 on 14B models). Switching to Q4_0 KV unlocked 128K+ allocation for all models, but extended benchmarking (§4.5) showed 128K filled context times out (TTFT >20 min). The practical filled ceiling is 96K for the MoE, qwen3.5:9b, and phi4-mini; most dense 8–14B models top out at 64K filled. 64K is the safe universal default where all models can process a full context. Higher contexts still work for short prompts (chat) where only a fraction of the window is filled.

3.5 Swap — NVMe-backed safety net

With the model consuming 11+ GB on a 16 GB system, in this setup disk swap was required for surviving inference peaks.

NVMe wear concern: Swap is a safety net, not an active paging target. In steady state, swap usage is ~400 MB (OS buffers pushed out to make room for model weights). SMART data after months of 24/7 operation: 3% wear, 25.4 TB total written. In steady state, the model runs in RAM — swap catches transient spikes during model load/unload transitions. Consumer NVMe drives rated for 300–600 TBW should last years at this write rate.

# Create 16 GB swap file (btrfs requires dd, not fallocate)
sudo dd if=/dev/zero of=/swapfile bs=1M count=16384 status=progress
sudo chattr +C /swapfile   # disable btrfs copy-on-write
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon -p 10 /swapfile

# Make permanent
echo '/swapfile none swap sw,pri=10 0 0' | sudo tee -a /etc/fstab

Disable/reduce zram — zram compresses pages in physical RAM, competing with the model:

sudo mkdir -p /etc/systemd/zram-generator.conf.d
echo -e '[zram0]\nzram-size = 2048' | sudo tee /etc/systemd/zram-generator.conf.d/small.conf
# Or disable entirely: zram-size = 0

3.6 Verify

sudo journalctl -u ollama -n 20 | grep total
# → total="12.3 GiB" available="12.3 GiB"  (GPU detection at startup, before model loading)
free -h
# → Swap: 15Gi total, ~1.4Gi used

3.7 Disable GUI (saves ~1 GB)

sudo systemctl set-default multi-user.target && sudo reboot

3.8 CPU governor — lock to performance

The stock schedutil governor down-clocks during idle, causing observable latency spikes at inference start on this setup. Lock all cores to full speed:

# Runtime (immediate)
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

# Persistent (systemd-tmpfiles)
echo 'w /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor - - - - performance' | \
  sudo tee /etc/tmpfiles.d/cpu-governor.conf

Memory layout after tuning

16 GB Unified Memory

Region Size Notes
VRAM carveout 512 MB BIOS-reserved from UMA pool (not separate memory)
GTT 16 GiB Tuned via ttm.pages_limit=4194304 (default 7.4 GiB). Deprecated amdgpu.gttsize removed from this setup.
TTM pages_limit 16 GiB ttm.pages_limit=4194304 — the only memory tuning parameter needed in this setup
Vulkan heap Size
Device-local 8.33 GiB
Host-visible 8.17 GiB
Total 16.5 GiB → 14B models fit, all tested inference on GPU (UMA — same physical pool)

UMA heap note: On this unified memory system, Vulkan reports multiple heaps totaling ~16.5 GiB, but these are overlapping logical views backed by the same 16 GB physical memory pool. They should not be interpreted as additive hardware capacity.

Consumer Usage Notes
Model weights (qwen3.5-35b-a3b-iq2m) 10.3 GiB GPU + 0.3 GiB CPU UD-IQ2_M, 41/41 layers on Vulkan at 4K ctx (spills at higher ctx — see §4.6)
KV cache (Q4_0 @ 4K) ~0.4 GiB Q4_0 KV: ~4× smaller than FP16. Grows ~0.1 GiB per 1K tokens
Compute graph ~0.2 GiB GPU-side
signal-cli + queue-runner ~1.0 GiB System RAM
OS + services ~0.9 GiB Headless Fedora 43
NVMe swap 16 GiB (374 MB used) Safety net
zram 0 B (allocated, not active) Device exists but disksize=0
Total loaded ~12.3 GiB (@4K) / ~12.5 GiB (@16K) ~4.0–4.2 GiB free

4. Models & Benchmarks

Benchmark methodology: All benchmarks below were run on a single BC-250 board (Fedora 43, kernel 6.18.9, Mesa 25.3.4 RADV, Ollama 0.20.0) with Q4_0 KV cache (KV cache quantized to 4-bit — see §4.4). Six measurement phases: performance baseline (33 models, single run), statistical validation (8 models, 3 runs each, CV <1.5%), filled-context scaling (32 attempted, 30 produced usable data), quality assessment (all models tested), cold-start TTFT (2 production models), and prefill scaling (4 prompt sizes per model). All performance results use a standardized ~88-token prompt with num_predict=100 (generate 100 output tokens). Prefill column reflects warm-model prompt processing speed.

Allocation vs Filled context: Earlier benchmarks tested context ceilings with tiny prompts and large num_ctx — this only measures KV cache allocation, not actual utilization. The filled-context benchmark (§4.5) fills 80% of num_ctx with real tokens and verifies prompt_eval_count to detect silent truncation. This revealed that Ollama silently caps some models to their native limit and that filled-context TTFT exceeds 20 minutes at 128K for every model tested. The "Filled Ctx" column below reflects the verified ceiling. The "Alloc Ctx" column shows the allocation ceiling (useful for chat where only a fraction of context is filled).

4.1 Compatibility table

Ollama 0.20.0 · Vulkan · RADV Mesa 25.3.4 · 16.5 GiB Vulkan · Q4_0 KV cache

Column guide: Params = total parameter count (35B/3B = 35B total, 3B active for MoE). Quant = weight quantization format. tok/s = output generation speed. Prefill = prompt processing speed (tok/s). Alloc Ctx = max context that allocates successfully. Filled Ctx = max context verified with 80% real tokens. VRAM @4K = GPU memory at 4K context.

Model Params Quant tok/s Prefill Alloc Ctx¹ Filled Ctx² VRAM @4K Status
qwen3.5-35b-a3b-iq2m 35B/3B UD-IQ2_M 38 123 64K⁷ 32K⁶ 12.3 GiB 🏆 Fallback >40K — MoE
qwen3.5:9b 9.7B Q4_K_M 32 146 128K 96K⁵ 7.9 GiB 🏆 Best context+vision
llama3.2:3b 3.2B Q4_K_M 104 10479 128K 64K 2.2 GiB ✅ Fastest tested
qwen2.5:3b 3.1B Q4_K_M 102 10738 128K 32K³ 2.1 GiB ⚠️ Truncated above 32K
phi4-mini 3.8B Q4_K_M 88 6968 128K 96K⁵ 2.5 GiB ✅ Fast + lightweight
gemma3:4b 4B Q4_K_M 76 3781 128K 3.8 GiB ✅ Multimodal
qwen3:4b 4B Q4_K_M 74 290 128K 2.9 GiB ✅ Thinking mode
Qwen3-Coder-30B-A3B 30.5B/3.3B UD-IQ2_M 62 2982 64K⁷ 64K 10.3 GiB ✅ Code-focused MoE
Qwen3-30B-A3B (Q2_K) 30.5B/3B Q2_K 60 2632 256K 64K 10.7 GiB ✅ MoE, heavy quant
qwen2.5:7b 7.6B Q4_K_M 55 5830 128K 32K 4.4 GiB ⚠️ 72% load failure rate
qwen2.5-coder:7b 7.6B Q4_K_M 55 5826 128K 4.4 GiB ✅ Code-focused
llama3.1:8b 8.0B Q4_K_M 51 174 128K 4.7 GiB ✅ Alloc tested
huihui_ai/seed-coder-abliterate 8.3B Q4_K_M 51 196 128K 4.8 GiB ✅ Code gen, uncensored
mannix/llama3.1-8b-lexi 8.0B Q4_0 50 959 128K 4.5 GiB ✅ Uncensored 8B
granite3.3:8b 8B Q4_K_M 46 5762 128K 4.9 GiB ✅ IBM Granite
qwen3-abl-nothink 8.2B Q4_K_M 46 166 128K 4.9 GiB ✅ Abliterated
huihui_ai/qwen3-abliterated:8b 8.2B Q4_K_M 46 166 128K 4.9 GiB ✅ Abliterated 8B
glm4:9b 9B Q4_K_M 45 178 128K 5.1 GiB ✅ GLM-4
qwen3:8b 8.2B Q4_K_M 44 1974 128K 64K 5.1 GiB ✅ Filled 64K verified
qwen3:8b-nothink 8.2B Q4_K_M 43 1985 128K 5.1 GiB
deepseek-r1:8b 8B Q4_K_M 43 1824 128K 5.1 GiB ✅ Reasoning
gemma2:9b 9.2B Q4_0 39 3346 128K 8K³ 6.9 GiB ⚠️ Truncated above 8K
mistral-nemo:12b 12.2B Q4_0 34 140 128K 64K 6.7 GiB ✅ Filled 64K verified
gemma4 27B MoE Q4_0 33 252 256K 3.0 GiB ✅ MoE, 31% GPU⁸
gemma4-26b-q3 26B MoE UD-Q3_K_M 39 1238 48K 48K 13.5 GiB 🏆 Primary chat, 100% GPU⁹
qwen3:8b-q8_0 8.2B Q8_0 31 196 128K 8.5 GiB ✅ Quality Q8
gemma3:12b 12B Q4_K_M 29 112 128K 8.7 GiB ✅ Multimodal 12B
deepseek-r1:14b 14B Q4_K_M 29 2298 128K 32K 8.5 GiB ✅ Filled 32K verified
phi4:14b 14.7B Q4_K_M 29 92 128K 16K³ 8.5 GiB ⚠️ Truncated above 16K
qwen3-14b-16k 14.8B Q4_K_M 27 91 128K 8.7 GiB ✅ Alloc tested
huihui_ai/qwen3-abliterated:14b 14.8B Q4_K_M 27 91 128K 8.7 GiB ✅ Alloc tested
qwen3:14b 14.8B Q4_K_M 27 91 128K 64K 8.9 GiB ✅ Filled 64K verified (R1)
qwen3.5-27b-iq2m 26.9B IQ2_M ❌ Timed out on 0.20⁴

¹ Alloc Ctx = maximum context where KV cache allocation succeeds (tiny prompt, large num_ctx). This is what the previous benchmark measured. Useful for chat with short prompts.

² Filled Ctx = maximum context verified with context actually filled to 80% with real tokens (extended benchmark). Timeout at 20 min per test. "—" = not yet tested with filled context. See §4.5 for full results.

96K TTFT caveat: MoE, qwen3.5:9b, and phi4-mini produce output at 96K filled (18.9, 19.6, 13.2 tok/s respectively), but TTFT exceeds 20 minutes — impractical for interactive use. The production ceiling is 64K (OLLAMA_CONTEXT_LENGTH=65536). See B10 for practical recommendations.

³ Silent truncation: Ollama silently caps these models to their native context limit without any error. The allocation test always passes, but prompt_eval_count reveals the model only processes tokens up to its native limit. qwen2.5:3b → 32K native, phi4:14b → 16K native, gemma2:9b → 8K native.

MoE 64K regression (qwen3.5-35b-a3b-iq2m): Ran at 22.9 tok/s in an initial test, but only 0.7 tok/s on a later isolated retest (same config). Likely caused by UMA memory fragmentation after extended uptime. 32K is the practical ceiling — stable at 28.5 tok/s across all rounds. See B5.2 for full analysis.

All models except gemma4 e4b run fully on GPU (100% offload) after GTT tuning (16 GiB). The qwen3.5-35b-a3b-iq2m fallback spills ~0.3 GiB of embeddings to CPU, which has negligible impact on UMA. Gemma 4 e4b runs at 31% GPU offload (see ⁸). The custom Gemma 4 26B MoE (UD-Q3_K_M, 13.5 GiB) runs at 100% GPU — the largest fully-offloaded model on BC-250 (see ⁹).

qwen3.5-27b-iq2m regression: Previously functional at 10.5 tok/s on Ollama 0.18. Times out at 4K context on Ollama 0.20.0 — appears to be a regression in the new Vulkan backend for this heavily quantized dense 27B model.

Alloc Ctx regression on Ollama 0.20: qwen3.5-35b-a3b-iq2m and Qwen3-Coder-30B-A3B dropped from 256K → 64K allocation ceiling after the 0.18 → 0.20 upgrade. These are heavily quantized MoE models. The filled-context ceiling (32K and 64K respectively) remains within the new allocation limit.

Gemma 4 e4b partial GPU offload: Only 31% of model weights are offloaded to GPU (3.0 GiB of ~9.7 GiB). Possibly a Vulkan backend limitation for the Gemma 4 e4b architecture on GFX1013. Despite partial GPU offload, generation speed (33 tok/s) is competitive with fully-offloaded 9B models, suggesting the active MoE experts fit in VRAM.

Gemma 4 26B MoE A4B (custom GGUF): The 26B A4B variant (128 experts, 8 active + 1 shared, 3.8B active params) runs at 39.0 tok/s with 100% GPU offload (13.5 GiB) using Unsloth UD-Q3_K_M quantization (11.6 GiB file). This is the largest model successfully run fully on GPU on BC-250. Ollama's official Q4_K_M (18 GB) exceeds 16 GB UMA and crashes the system. Context ceiling is 48K (49152 verified at 33.7 tok/s; 65K times out after 5 min, exhausts RAM + swap). Prefill reaches 1238 tok/s at 4K context.

IQ2_M basic functionality confirmed: Quality benchmarks (5 tasks × 3 runs) confirmed that the 35B MoE scored 14/15 (93%) on summarization, JSON extraction, fact recall, instruction following, and arithmetic — while the 9B Q4_K_M fallback scored 15/15 (100%). The extreme quantization (~2.5 bits per parameter) doesn't break basic functionality on these tasks. However, the benchmark tasks are simple enough that even 3B models score 93% — they do not measure nuance, reasoning depth, or generation quality where larger models are expected to have an advantage. Complex mathematical reasoning and multi-step logic were not tested. See §4.5a for details.

4.2 Benchmark visualization

Generation speed (tok/s) — higher is better (Q4_0 KV, all GPU):

Model                          tok/s   Max Ctx   ██ = 10 tok/s
──────────────────────────────────────────────────────────────────
llama3.2:3b                      104     128K  ██████████▍
qwen2.5:3b                       102     128K  ██████████▏
phi4-mini                         88     128K  ████████▊
gemma3:4b                         76     128K  ███████▋
qwen3:4b                          74     128K  ███████▍
Qwen3-Coder-30B-A3B               62      64K  ██████▏ ← code MoE
Qwen3-30B-A3B (Q2_K)              60     256K  ██████
qwen2.5:7b                        55²    128K  █████▌  ← 72% load failure
qwen2.5-coder:7b                  55     128K  █████▌
llama3.1:8b                       51     128K  █████▏
seed-coder-abl:8b                 51     128K  █████
lexi-8b (uncensored)              50     128K  █████
granite3.3:8b                     46     128K  ████▋
qwen3-abl:8b                      46     128K  ████▋
glm4:9b                           45     128K  ████▌
qwen3:8b                          44     128K  ████▍
deepseek-r1:8b                    43     128K  ████▎
gemma2:9b                         39     128K  ███▉
★ gemma4-26b-q3 (26B MoE)          39      48K  ███▉  ← PRIMARY chat, 100% GPU⁹
★ qwen3.5-35b-a3b-iq2m            38      64K  ███▊  ← FALLBACK >40K (35B/3B)
mistral-nemo:12b                  34     128K  ███▍
gemma4 (27B MoE)                  33     256K  ███▎  ← e4b, 31% GPU⁸
★ qwen3.5:9b                      32     128K  ███▏  ← best ctx + vision
qwen3:8b-q8_0                     31     128K  ███▏  ← quality Q8
gemma3:12b                        29     128K  ██▉
deepseek-r1:14b                   29     128K  ██▉
phi4:14b                          29     128K  ██▉
qwen3-abl:14b                     27     128K  ██▋
qwen3:14b                         27     128K  ██▋

² qwen2.5:7b speed from successful runs only (72% intermittent load failure; see B4).

Context ceiling per model (Q4_0 KV, all GPU):

Model                      4K  8K  16K  32K  64K  128K  256K
─────────────────────────────────────────────────────────────
qwen2.5:3b                 ✅  ✅   ✅   ✅   ✅    ✅    —
llama3.2:3b                ✅  ✅   ✅   ✅   ✅    ✅    —
phi4-mini                  ✅  ✅   ✅   ✅   ✅    ✅    —
gemma3:4b                  ✅  ✅   ✅   ✅   ✅    ✅    —
qwen3:4b                   ✅  ✅   ✅   ✅   ✅    ✅    —
qwen2.5:7b                 ✅  ✅   ✅   ✅   ✅    ✅    —   ⚠️ 72% load failure
qwen2.5-coder:7b           ✅  ✅   ✅   ✅   ✅    ✅    —
qwen3:8b                   ✅  ✅   ✅   ✅   ✅    ✅    —
qwen3:8b-q8_0              ✅  ✅   ✅   ✅   ✅    ✅    —
qwen3-abl:8b               ✅  ✅   ✅   ✅   ✅    ✅    —
deepseek-r1:8b             ✅  ✅   ✅   ✅   ✅    ✅    —
seed-coder:8b              ✅  ✅   ✅   ✅   ✅    ✅    —
llama3.1:8b                ✅  ✅   ✅   ✅   ✅    ✅    —
lexi-8b                    ✅  ✅   ✅   ✅   ✅    ✅    —
granite3.3:8b              ✅  ✅   ✅   ✅   ✅    ✅    —
glm4:9b                    ✅  ✅   ✅   ✅   ✅    ✅    —
gemma2:9b                  ✅  ✅   ✅   ✅   ✅    ✅    —
★ qwen3.5:9b               ✅  ✅   ✅   ✅   ✅    ✅    —
gemma3:12b                 ✅  ✅   ✅   ✅   ✅    ✅    —
mistral-nemo:12b           ✅  ✅   ✅   ✅   ✅    ✅    —
qwen3:14b                  ✅  ✅   ✅   ✅   ✅    ✅    —
phi4:14b                   ✅  ✅   ✅   ✅   ✅    ✅    —
deepseek-r1:14b            ✅  ✅   ✅   ✅   ✅    ✅    —
★ MoE 35B-A3B              ✅  —    ✅   ✅   ✅    ❌    —   ⁷ was 256K on 0.18
Qwen3-Coder-30B-A3B        ✅  —    ✅   ✅   ✅    ❌    —   ⁷ was 256K on 0.18
Qwen3-30B-A3B (Q2_K)       ✅  —    ✅   ✅   ✅    ✅    ✅
gemma4 (27B MoE)           ✅  —    ✅   ✅   ✅    ✅    ✅
★ gemma4-26b-q3 (26B MoE)  ✅  ✅   ✅   ✅   ❌    —    —   ⁹ 48K max, 65K=FAIL

✅ = works 100% GPU | ❌ = timeout/fail | — = not tested

Every dense model tested allocates 128K. Qwen3-30B-A3B (Q2_K) and gemma4 e4b allocate 256K. The gemma4-26b-q3 (UD-Q3_K_M, 13.5 GiB) reaches 48K but times out at 65K — the large model weight leaves limited KV cache budget. Two MoE models (35B-A3B, Coder-30B) regressed from 256K → 64K after the Ollama 0.18 → 0.20 upgrade (see ⁷). Filled-context ceilings are lower and shown separately in the tables above.

Graphical benchmarks (see §B for full methodology):

Generation speed

Prefill speed

Generation vs Prefill — all models side by side

Note on the Gemma 4 prefill outlier: Gemma 4 26B reaches ~1238 tok/s prefill — far above any other model on the chart. This likely reflects its unusual MoE design: 128 experts with only 8 active + 1 shared per token (~3.8B active out of 26B total, ~15% activation). During prefill (which tends to be compute-bound), the router may only need to evaluate a small fraction of the model per token, allowing much higher throughput. The other MoE models here — Qwen3-Coder-30B (A14B, ~47% activation) and Qwen3-30B Q2_K (A3B but aggressively quantized to ~8 GiB) — appear to benefit less from sparsity, either because they activate a larger share of experts or because their small memory footprint already reduces the compute advantage. That said, this is a plausible explanation rather than a confirmed one — the actual dispatch behavior depends on the Ollama Vulkan backend, which has not been profiled.

4.3 Context window experiments

Historical note: The experiments below were conducted with FP16 KV cache before Q4_0 KV was deployed. With Q4_0 KV deployed (see §4.4), these memory constraints no longer apply — qwen3:14b now reaches 128K context without deadlocks. This section is preserved to document the FP16 behavior for reference.

The context window directly controls KV cache size, and on 16 GB unified memory, every megabyte counts. After v7 (OpenClaw removal freed ~700 MB, GTT tuned — see §3.3), all context sizes were re-tested systematically:

Context window vs memory (qwen3:14b Q4_K_M, flash attention, 16 GB GTT)

Context RAM Used Free Swap Speed Status
8192 ~9.5 GB 6.5 GB ~27 tok/s ✅ Safe
12288 ~10.3 GB 5.7 GB ~27 tok/s ✅ Conservative
16384 ~11.1 GB 4.9 GB ~27 tok/s ✅ Comfortable
18432 ~13.2 GB 2.7 GB 0.9 GB 26.8 tok/s ✅ Works
20480 ~13.7 GB 2.3 GB 0.9 GB 26.8 tok/s ✅ Works
22528 ~14.0 GB 2.0 GB 0.9 GB 26.7 tok/s ✅ Works
24576 ~14.4 GB 1.5 GB 0.9 GB 26.7 tok/s ✅ Max for qwen3:14b
26624 ~14.6 GB 1.3 GB 1.0 GB 23.9 tok/s ⚠️ 10% slower
28672 ~14.2 GB 1.7 GB timeout ❌ Deadlocks
32768 ~15.7 GB 0.2 GB 2.1 GB timeout ❌ Deadlocks
40960 ~16.0 GB 0 💀 TTM fragmentation¹

24K is the sweet spot — full speed (~27 tok/s), leaves ~1.5 GB for OS/services with stable swap at 0.9 GB. 26K works but inference drops 10% due to swap pressure. 28K+ deadlocks under Vulkan.

¹ Why 40K fails isn't raw OOM. The math: 9.3 GB weights + 2 GB KV cache + 1 GB OS ≈ 12.3 GB < 16 GB available. The failure is consistent with TTM fragmentation — the kernel's TTM memory manager likely can't allocate a contiguous block large enough for the KV cache because physical pages are fragmented across GPU and CPU consumers. This is a UMA-specific problem: on discrete GPUs with dedicated VRAM, fragmentation doesn't cross the PCIe boundary.

History: The original 24K experiment deadlocked because OpenClaw gateway consumed ~700 MB. After v7 removed OpenClaw and bumped GTT to 14 GB, 24K became stable. Flash attention (OLLAMA_FLASH_ATTENTION=1) was required in this configuration — without it, 24K did not fit.

4.4 KV cache quantization — breaking the context ceiling

Just as model weights can be quantized (16-bit → 4-bit) to save memory, the KV cache can be quantized too. The KV cache stores intermediate attention state for every token in the context window — at FP16, this dominates memory usage at large context sizes. Quantizing it to Q4_0 (4-bit) shrinks KV memory ~4× with negligible quality impact on this hardware.

Q4_0 KV cache is now deployed in production. This raised the BC-250 from 16–64K usable context (FP16) to 128K+ allocation for all models.

KV Type Context Ceiling (14B) Context Ceiling (Qwen MoE 35B) KV Size @24K Gen tok/s Notes
FP16 (old default) 24K (40K deadlocked) 16K ~3.8 GiB 27.2 Previous production
Q8_0 64K+ 64K+ ~2.0 GiB 27.3 Conservative
Q4_0 (current) 128K 256K ~1.1 GiB 27.3 ← deployed

Q4_0 KV cache scaling: ~45 MiB per 1K tokens (vs ~400 MiB/1K for FP16). At 128K context, KV cache is ~5.8 GiB — fits alongside 8.9 GiB 14B model weights within the 16.5 GiB Vulkan pool.

Quantization impact test (qwen3:8b):

Model Quant KV Type tok/s Prefill Max Ctx VRAM @4K
Q4_K_M Q4_0 43.2 158 128K 5.1 GiB
Q8_0 Q4_0 30.6 184 128K 8.5 GiB

Q8_0 model weights are 29% slower with 67% more VRAM but higher precision. Both reach 128K context with Q4_0 KV.

Historical: FP16 KV context experiments (qwen3:14b, pre-Q4_0)

These earlier measurements show the FP16 KV limitations that Q4_0 eliminated:

Context KV Type Speed Status
24576 FP16 26.7 tok/s ✅ Max for qwen3:14b
28672 FP16 timeout ❌ Deadlocks
32768 FP16 timeout ❌ Deadlocks
24576 Q4_0 27.3 tok/s
48000 Q4_0 27.3 tok/s
128000 Q4_0 27.3 tok/s

Generation speed degrades with context fill (Q4_0, all layers on GPU):

Tokens in context Gen tok/s Prefill tok/s Notes
~100 (empty) 27.2 58 Headline number
3,300 24.6 113 Typical Signal chat
10,000 20.7 70 Long job output
30,000 13.4 53 Heavy document analysis
40,960 (max fill) ~10* ~42 Theoretical, near KV limit

* Estimated from degradation curve. One test at 41K showed 1.2 tok/s, but that was caused by model partial offload (21/41 layers spilled to CPU), not normal operation.

# Production config (in /etc/systemd/system/ollama.service.d/override.conf):
Environment=OLLAMA_KV_CACHE_TYPE=q4_0
Environment=OLLAMA_CONTEXT_LENGTH=65536
# Default 64K — verified filled-context ceiling (see §4.5)

4.5 Extended context benchmark — filled context verification

Previous context ceiling tests used tiny prompts with large num_ctx — this tests KV cache allocation, not actual utilization. The extended re-benchmark fills context to 80% with real tokens, verifies prompt_eval_count matches expected token count, and monitors system resources.

Methodology:

  • Context filled to 80% of num_ctx with repeated English text blocks (~500 tokens each)
  • Two phases per context size: (1) allocation test (tiny prompt), (2) filled test (80% real tokens)
  • prompt_eval_count verified against expected token count to detect silent truncation
  • System RAM and swap monitored via /proc/meminfo before/after each test
  • Timeout: 20 minutes per request. OLLAMA_CONTEXT_LENGTH set to 524288 (uncapped)
  • Services stopped for clean measurements. Single run per configuration.

Results — generation speed with filled context (tok/s):

Model 4K 8K 16K 32K 64K 96K 128K Notes
MoE 35B-A3B 35.7 34.2 31.9 27.9 22.5 18.9 TIMEOUT Ceiling at 96K filled
qwen3.5:9b 31.2 30.4 29.0 26.6 22.6 19.6 TIMEOUT Ceiling at 96K filled
🏆 gemma4-26b-q3 35.2 33.3 31.1 27.7 TIMEOUT Ceiling at 48K filled (40K=26.4, 48K=25.0)
qwen2.5:3b 93.6 87.9 77.8 62.0 32K³ 32K³ 32K³ Truncated above 32K
phi4-mini 72.5 61.5 46.8 31.1 18.7 13.2 TIMEOUT Ceiling at 96K filled
qwen3:8b 39.1 35.4 29.5 21.6 14.3 TIMEOUT Ceiling at 64K filled
qwen3:14b 24.9 23.4 20.4 15.7 11.0 TIMEOUT Ceiling at 64K filled
phi4:14b 25.7 23.1 19.0 16K³ 16K³ 16K³ 16K³ Truncated above 16K
mistral-nemo:12b 31.2 28.5 24.0 18.1 12.1 TIMEOUT Ceiling at 64K filled

³ Silent truncation: Ollama processes only the model's native context limit worth of tokens, silently discarding the rest. The allocation test always passes.

Key findings:

  1. Silent truncation discovered: Ollama silently caps context to the model's native limit. qwen2.5:3b → 32K, phi4:14b → 16K. No error reported — only prompt_eval_count reveals the cap. The old allocation-only benchmark would never catch this.

  2. 128K fill impossible on this hardware: No model completed 128K filled context within 20 minutes. The Qwen MoE's (qwen3.5-35b-a3b) 96K fill took 581 seconds (9.7 min TTFT), and prefill rate degrades from 234 tok/s (4K) to 105 tok/s (96K). At 128K, estimated TTFT would be ~17-25 minutes.

  3. Speed degrades 37-63% from 4K to 64K filled: Qwen MoE goes from 35.7 → 22.5 tok/s (37% drop). Dense 8B models drop 63%. Within 4K–64K, degradation tracks roughly linear in log(context_length), suggesting memory bandwidth (not quadratic attention compute) is the dominant cost at these scales. The old benchmark masked this by not filling context.

  4. Practical ceiling is 32K-64K for interactive use: At 32K, TTFT is 2-3 minutes (acceptable for batch jobs). At 64K, TTFT is 5-12 minutes. Above 64K, only batch processing (not interactive chat) is practical.

  5. OLLAMA_CONTEXT_LENGTH set to 65536 (64K): This is the verified universal ceiling where all models can process a filled context. Higher values still work for chat with short prompts.

  6. Re-benchmark confirmation: The multi-run re-benchmark reproduced all initial context scaling data within ±1 tok/s. Qwen MoE at 64K filled: 22.9 tok/s (initial: 22.5). qwen3:14b at 32K filled: 16.4 tok/s (initial: 15.7).

4.5a Quality & statistical validation

Follow-up benchmark with repeated measurements and quality assessment.

Statistical validation — 3 runs × 8 models confirms single-run reliability:

Model Gen median Range CV%
llama3.2:3b 102.2 [101.3 – 103.9] 1.3%
phi4-mini 86.1 [85.0 – 87.4] 1.4%
Qwen3-30B-A3B (Q2_K) 58.5 [57.9 – 58.9] 0.9%
qwen3:8b 42.8 [42.8 – 43.0] 0.3%
qwen3.5-35b-a3b-iq2m (MoE) 37.5 [37.3 – 37.6] 0.4%
mistral-nemo:12b 34.0 [33.9 – 34.0] 0.2%
qwen3.5:9b 31.7 [31.7 – 31.9] 0.4%
qwen3:14b 26.6 [26.6 – 26.7] 0.2%

CV <1.5% across all 8 models tested. Single-run measurements are reliable on this thermally steady UMA system.

Quality assessment — 5 tasks × 3 runs, scored by Python script (keyword match, JSON parse, regex):

Task MoE 35B-A3B qwen3.5:9b
Summarization 3/3 ✅ 3/3 ✅
JSON extraction 3/3 ✅ 3/3 ✅
Fact recall 3/3 ✅ 3/3 ✅
Instruction following 2/3 ⚠️ 3/3 ✅
Arithmetic (17 × 23) 3/3 ✅ 3/3 ✅
Total 14/15 (93%) 15/15 (100%)

The MoE's one miss was adding preamble before a numbered list — the list itself was correct. These tasks confirm basic functionality (text manipulation, structured output, factual recall) but are too simple to differentiate model quality — even 3B models score 93%. They do not test reasoning depth, nuance, or generation quality where larger models are expected to have real advantages.

Cold-start TTFT — model fully unloaded → first token:

Model Median Load time
MoE 35B-A3B 18.0s 16.2s (~660 MB/s from NVMe)
qwen3.5:9b 7.0s 5.6s (~1.1 GB/s from NVMe)

With OLLAMA_KEEP_ALIVE=30m, cold start (18.0s) occurs only after 30 minutes of inactivity. Warm TTFT at short prompts: 0.3–1.7s.

4.6 Prefill (prompt evaluation) benchmarks

On UMA, both prefill and generation share memory bandwidth. Prefill is the time the model spends "reading" the prompt before generating the first token.

For embedded engineers: Think of LLM inference as two phases — like a bootloader and a main loop. Prefill is the "bootloader": the model processes the entire input prompt in one burst (parallel, compute-bound — like DMA-ing a firmware image into SRAM). Token generation is the "main loop": the model produces output tokens one at a time, sequentially (memory-bandwidth-bound — like polling a UART at a fixed baud rate). MoE (Mixture of Experts) is like having 35 specialized ISRs but only routing to 3 of them per interrupt — you get the routing intelligence of knowing all 35, but only pay the execution cost of 3. This is the likely reason the 35B MoE measured faster than the 14B dense model on this hardware (see §4.9 for caveats).

Prefill rate vs prompt size — production models (Q4_0 KV cache, warm):

qwen3.5-35b-a3b-iq2m (MoE 35B/3B active, UD-IQ2_M):

Prompt Size Tokens Prefill Gen tok/s TTFT (warm)
Tiny 17 53 tok/s 39.3 0.3s
Short 42 68 tok/s 39.6 0.6s
Medium 384 231 tok/s 38.5 1.7s
Long 1,179 228 tok/s 38.3 5.2s

qwen3.5:9b (Q4_K_M, dense 9.7B):

Prompt Size Tokens Prefill Gen tok/s TTFT (warm)
Tiny 17 61 tok/s 33.2 0.3s
Short 42 118 tok/s 33.0 0.4s
Medium 384 229 tok/s 33.0 1.7s
Long 1,179 225 tok/s 32.5 5.2s

Observations: Both production models converged to ~230 tok/s prefill at medium-to-long prompts in this testing — an observed pattern whose mechanism is unproven (could be Vulkan dispatch overhead, memory controller bandwidth, or another bottleneck; see §4.9). At tiny prompts (<50 tokens), GPU compute overhead dominates and prefill drops to 53–61 tok/s. Generation rate was stable across prompt sizes in this testing: MoE held 38–39 tok/s, 9B held 32–33 tok/s. TTFT scales linearly: at 384 tokens it's ~1.7s, at 1.2K tokens it's ~5.2s. For real-world Signal chat (3K system prompt + conversation), expect TTFT of ~15–20s on cold start, <2s when the model is warm (prompt cached via OLLAMA_KEEP_ALIVE=30m).

Historical: qwen3:14b Q4_K_M (previous primary, 24K context)
Prompt Size Tokens Prefill Gen tok/s TTFT (warm)
Tiny 86 88 tok/s 27.2 ~1s
Short 353 67 tok/s 27.2 ~5s
Medium 1,351 128 tok/s 26.1 ~11s
Long 3,354 113 tok/s 24.6 ~30s
XL 6,686 88 tok/s 22.5 ~76s
Massive 10,014 70 tok/s 20.7 ~143s

Generation rate degrades with context: 27.2 tok/s @small → 20.7 tok/s @10K tokens.

Graphical: prefill rate and generation rate vs prompt size:

Prefill and generation rate vs prompt size

Model Landscape Bubble Chart — generation speed × prefill speed × max context (bubble size = context window, unique color per model). Single-run data; relative positions are representative but absolute numbers may differ slightly from the multi-run re-benchmark (B2–B3).

Model landscape — numbered 3D

Model landscape — bubble chart

4.7 Memory budget

qwen3.5-35b-a3b-iq2m · headless server (from Ollama logs)

Component Qwen MoE @4K ctx Qwen MoE @16K ctx Notes
Model weights (GPU) 10.3 GiB ~8.2 GiB 41/41 layers on Vulkan at 4K; spills to CPU at higher ctx
Model weights (CPU) 0.3 GiB ~0.4 GiB Spilled layers + embeddings
KV cache (GPU) 1.6 GiB ~3.8 GiB Measured from Ollama logs at each ctx size
Compute graph ~0.2 GiB ~0.2 GiB GPU-side
Ollama total 12.3 GiB ~12.5 GiB Ollama dynamically spills weights to make room for KV
OS + services ~0.9 GiB ~0.9 GiB Headless Fedora 43
Free (of 16.5 Vulkan) ~4.2 GiB ~4.0 GiB
NVMe swap 16 GiB Safety net

Qwen MoE memory dynamics: As context grows, Ollama spills weight layers from GPU to CPU to maintain a ~12.5 GiB total. The qwen3.5-35b-a3b's total weight (11 GB GGUF) is larger than qwen3:14b (9.3 GB), but only 3B params activate per token — so non-selected expert layers may reduce the penalty relative to a dense model, though this was not isolated experimentally. At 24K+ context, the KV cache exceeds what can fit alongside the weights, causing OOM or timeout.

4.8 Model recommendations

The primary chat model is gemma4-26b-q3 (Gemma 4 26B MoE, UD-Q3_K_M — 26B total params, ~3.8B active per token, 39 tok/s, 48K verified filled context, 100% quality). This is the largest model that runs at 100% GPU offload on BC-250 (13.5 GiB). It achieves the highest prefill throughput of any fully-offloaded model (1238 tok/s at 4K) — likely because its 128-expert MoE architecture activates only ~15% of weights per token during the compute-bound prefill phase. Filled-context verified at all sizes from 4K to 48K with no truncation: 35.2 tok/s at 4K → 25.0 tok/s at 48K (−29% degradation). Prefill remains strong even at 48K (148 tok/s), though TTFT reaches 3.2 min. Context is limited to 48K (65K times out), and VRAM headroom is tight. Also used by all 25+ automated LLM scripts and scrapers (typical prompt sizes 2–14K tokens, well within the 48K ceiling) — sharing the same model with the chat bot avoids Ollama model-swap overhead.

The fallback model is qwen3.5-35b-a3b-iq2m (MoE — 35B total params, 3B active per token, 37.5 tok/s, 32K practical filled context, 93% quality) — used when prompts exceed 40K tokens, where gemma4's 48K ceiling leaves insufficient headroom. Chosen for the largest knowledge capacity that fits in 16 GB UMA while maintaining practical speed, likely benefiting from only 3B parameters activating per token on this scalar GPU (see §4.9 for caveats). 64K filled context has been achieved (22.9 tok/s) but showed severe regression on a later retest after extended uptime (0.7 tok/s — likely UMA memory fragmentation; see B5.2). For vision and multimodal tasks, qwen3.5:9b (dense 9B, 31.7 tok/s, 64K, 100%) provides native image understanding. For the fastest inference, phi4-mini (dense 3.8B, 86.1 tok/s, 64K, 93%) is the fastest model that passes all basic quality checks.

All tok/s figures are Phase 2 medians (3 runs; B3) except gemma4-26b-q3 (measured via §4.9 Ollama comparison, not in Phase 2). Filled context ceilings are verified with 80% real-token fill and prompt_eval_count truncation detection (B5). Quality scores are 5 tasks × 3 runs, deterministic scoring (B4).

The full recommendation table — including reasoning, batch, speed-critical, and image generation picks — is in B10. Model Recommendations.

Production triple-model config: gemma4-26b-q3 as primary chat model (≤40K tokens, 49152 ctx) with qwen3.5-35b-a3b-iq2m as fallback for prompts >40K tokens (65536 ctx). For vision, switch to qwen3.5:9b. All use OLLAMA_KV_CACHE_TYPE=q4_0.

# Primary chat model (26B MoE, 100% GPU, 48K verified filled ctx) — custom GGUF via Modelfile
ollama create gemma4-26b-q3 -f Modelfile-gemma4-26b-q3

# Fallback model (35B MoE, >40K tokens) — custom GGUF via Modelfile
ollama create qwen3.5-35b-a3b-iq2m -f Modelfile-qwen35-35b-a3b

# Vision model (dense 9B, official Ollama)
ollama pull qwen3.5:9b

Why not a bigger MoE? All 35B params must reside in memory even though only 3B activate per token — the router decides at runtime which expert sub-networks to fire. At IQ2_M (~2.5 bits/param), 35B = 11 GB GGUF. The next MoE up — Qwen3-235B-A22B — would be ~44 GB at IQ2_M (2.7× too large). Going below IQ2_M (e.g. IQ1_S at ~1.5 bits) caused severe quality degradation in testing.

4.9 Benchmark limitations

The benchmark campaign measures this specific BC-250 board under one software stack. The following boundaries apply:

  • Quality coverage is partial. 32 models attempted, 31 produced usable results. qwen2.5:7b scored 20% (corrupted by 72% loading bug; only fact recall passes). qwen3.5-27b-iq2m scored 0% (all 15 tasks timed out). Two models scored low due to think:false not being honored — thinking tokens consumed the output budget.
  • Filled-context coverage is partial. 32 models attempted at 4K–64K with 80% real-token fill. 25 reach 64K, 5 have lower native ceilings (32K or 16K), 1 broken (qwen2.5-coder:7b, pec=0), 1 too large to load (qwen3.5-27b-iq2m). All 32 were tested; not all produced usable data.
  • Long-context quality is limited in scope. Tested on 4 production models at 16K and 32K. Embedded fact retrieval: 24/24 pass. Multi-hop reasoning: 8/32 pass. Long-range synthesis: 31/32 pass. Not tested at 64K, not tested on non-production models.
  • Ollama backend lag. As documented in §4.10, Ollama bundles a llama.cpp Vulkan backend that lags upstream HEAD. Building llama.cpp from HEAD yields +45% generation speed on Qwen MoE at 4K context and +7% on dense models (measured against Ollama 0.18; likely similar on 0.20 since generation speeds are unchanged). The Qwen MoE-specific gap suggests newer Vulkan shader optimizations may not yet be in Ollama's vendored copy. However, Ollama's memory management enables 32K–64K contexts that raw llama.cpp cannot achieve (system crash). All tok/s numbers in this document are Ollama measurements — actual hardware potential is higher at small contexts, but Ollama unlocks large contexts.

4.10 Ollama vs upstream llama.cpp — Vulkan overhead analysis

A controlled comparison was performed between Ollama 0.18.0 and upstream llama.cpp HEAD (commit 6b949d1) built from source on-device with -DGGML_VULKAN=ON. All tests run on fresh reboot with services disabled, caches dropped between tests, OOM protection enabled (oom_score_adj=-1000). Note: this comparison predates the Ollama 0.20 upgrade. The 0.20 re-benchmark (§4.1) shows generation speeds unchanged from 0.18 (±0.1% average), so the overhead gap likely persists. A re-test of llama-bench at 16K context caused an OOM system freeze, consistent with previous observations. Gemma 4 was tested separately with llama.cpp HEAD b7ad48e (Gemma 4 architecture support added in #21406).

Methodology: llama-bench with -r 1 -fa 1 -ctk q4_0 -ctv q4_0 -ngl 99 (single repetition, flash attention, quantized KV cache). Ollama with matching settings: OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q4_0 OLLAMA_CONTEXT_LENGTH=65536. Same model files (hardlinks to Ollama blobs). Fresh reboot, no other processes running.

Generation speed (tok/s):

Model Context llama.cpp TG Ollama TG Overhead
Qwen3 MoE 30B-A3B IQ2_M (10.1 GB) 4K 84.3 58.3 1.45×
Qwen3 MoE 30B-A3B IQ2_M 16K 84.7 49.0 1.73×
Qwen3 MoE 30B-A3B IQ2_M 32K ☠ crash 39.2
Qwen3 MoE 30B-A3B IQ2_M 64K ☠ crash 28.7
DeepSeek-R1 14B Q4_K_M (8.4 GB) 4K 29.0 27.2 1.07×
DeepSeek-R1 14B 16K 29.0
DeepSeek-R1 14B 32K ☠ crash 20.9
Gemma 4 26B Q3_K_M (11.6 GiB) 4K 0.07† 39.0 0.002×

† Vulkan GPU offload appears non-functional — llama.cpp produces 0.07 tok/s TG vs 11.4 tok/s on CPU-only (ngl=0). See finding 6 below.

Prompt processing speed (tok/s):

Model Context llama.cpp PP Ollama PP Ratio
Qwen3 MoE 30B-A3B IQ2_M 4K 285.0 316.4 0.90×
Qwen3 MoE 30B-A3B IQ2_M 16K 152.8 228.1 0.67×
Qwen3 MoE 30B-A3B IQ2_M 32K ☠ crash 157.4
Qwen3 MoE 30B-A3B IQ2_M 64K ☠ crash 96.8
DeepSeek-R1 14B 4K 133.9 127.5 1.05×
DeepSeek-R1 14B 16K 82.6
DeepSeek-R1 14B 32K ☠ crash 83.2
Gemma 4 26B Q3_K_M 4K 1.15† 1238 0.001×

Key findings:

llama.cpp vs Ollama — generation speed

  1. Generation overhead is model-dependent. Qwen MoE models show 1.45–1.73× overhead (growing with context), while dense models show only 1.07×. The wide gap for MoE vs near-parity for dense suggests the overhead may be in Vulkan sparse-layer dispatch — possibly because Ollama's vendored shaders lag upstream by a couple of weeks and miss recent MoE-specific optimizations.

  2. Prompt processing: Ollama is sometimes faster. At 4K and 16K context, Ollama's PP is 7–32% faster than raw llama.cpp for the Qwen MoE. This suggests Ollama uses different batch scheduling or prompt caching that benefits prefill throughput.

  3. 32K+ context: Ollama is essential. Raw llama.cpp crashes the system at 32K for both models (10.1 GB and 8.4 GB). The 16 GB UMA cannot fit model weights + large KV cache without Ollama's memory management. Ollama successfully runs all context sizes up to 64K, making 32K–64K contexts only possible through Ollama on this hardware.

  4. llama-bench TG is context-invariant. At 4K and 16K, llama.cpp generation speed is nearly identical for both models (Qwen MoE: 84.3 vs 84.7; DeepSeek: 29.0 vs 29.0 tok/s). Ollama shows degradation (Qwen MoE: 58.3 → 49.0), likely because it pre-allocates the full context window (up to 65K slots) regardless of actual usage, which on a 16 GB UMA system would cost memory bandwidth even at small context sizes.

  5. Fresh-reboot 64K regression resolved. Fresh-reboot Ollama achieves 28.7 tok/s at 64K (vs 0.7 tok/s on stale system with memory fragmentation). This strongly suggests the regression documented in §4.8 is caused by UMA memory fragmentation rather than a software bug, though the exact mechanism has not been confirmed at the driver level.

  6. Gemma 4 MoE: llama.cpp Vulkan appears non-functional. With ngl=99, llama-bench produces 0.07 tok/s TG and 1.15 tok/s PP — orders of magnitude slower than CPU-only fallback (11.4 TG, 214.6 PP at pp512). Meanwhile Ollama achieves 39 tok/s TG and 1238 tok/s PP using its own Vulkan dispatch. This suggests Ollama's vendored Vulkan shaders may handle Gemma 4's MoE dispatch differently than upstream llama.cpp on GFX1013. In practice, Gemma 4 is only usable through Ollama on this hardware — though this may change as upstream Vulkan support for the architecture matures.

All results reproduced across 3 fresh boots. TG is highly stable (<1% variance), PP shows up to 12% boot-to-boot variance (likely Vulkan shader compilation timing). System crashes at 32K are deterministic for both models — raw llama.cpp has no memory budget awareness for UMA constraints. The TG gap is wider at small contexts because llama-bench allocates only what's needed, while Ollama appears to pre-allocate the full context window — at larger contexts the overhead converges as both backends pay similar memory bandwidth costs.


PART II — AI Stack

5. Signal Chat Bot

The BC-250 runs a personal AI assistant accessible via Signal messenger — no LLM gateway, no agent framework. signal-cli runs as a standalone systemd service exposing a JSON-RPC API, and queue-runner handles all LLM interaction directly.

  Signal --> signal-cli (JSON-RPC :8080) --> queue-runner --> Ollama --> GPU (Vulkan)

Software: signal-cli v0.13.24 (native binary) · Ollama 0.18+ · queue-runner v7

5.1 Why not OpenClaw

OpenClaw was the original gateway (v2026.2.26, Node.js). It was replaced because:

Problem Impact
~700 MB RSS On a 16 GB system, that's 4.4% of RAM consumed by a routing layer
15+ second overhead per job Agent turn setup, tool resolution, system prompt injection — for every cron job
Unreliable model routing Fallback chains and timeout cascades caused 5-min "fetch failed" errors
No subprocess support Couldn't run Python/bash scripts directly — had to shell out through the agent
9.6K system prompt Couldn't be trimmed below ~4K tokens without breaking tool dispatch
Orphan processes signal-cli children survived gateway OOM kills, holding port 8080

The replacement: queue-runner talks to signal-cli and Ollama directly via HTTP APIs. No agent framework in between.

See Appendix A for the original OpenClaw configuration.

5.2 signal-cli service

signal-cli runs as a standalone systemd daemon with JSON-RPC (signal-cli manpage). The port, flags, and systemd unit configuration below are local implementation choices — the JSON-RPC API is an upstream feature, but the specific service layout is custom:

# /etc/systemd/system/signal-cli.service
[Unit]
Description=signal-cli JSON-RPC daemon
After=network.target

[Service]
Type=simple
ExecStart=/opt/signal-cli/bin/signal-cli --output=json \
  -u +<BOT_PHONE> jsonRpc --socket http://127.0.0.1:8080
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

Register a separate phone number for the bot via signal-cli register or signal-cli link.

5.3 Chat architecture

Between every queued job, queue-runner.py polls the signal-cli journal for incoming messages. Messages are routed based on content type:

queue-runner v7 — continuous loop

  job N  →  check Signal inbox  →  route message  →  job N+1
                    |                     |
                    v                     |
            journalctl -u          ┌──────┼──────┐
            signal-cli             │      │      │
                                audio  image   text
                                   │      │      │
                                   v      v      v
                              whisper  qwen3.5  choose_chat_model()
                              -cli     :9b       │
                              (Vulkan) vision    ├─ ≤40K → gemma4-26b-q3
                                   │      │     └─ >40K  → MoE 35B fallback
                                   │      │      │
                                   v      v      v
                              signal-cli: send reply

Signal Pipeline

Key parameters:

Setting Value Purpose
SIGNAL_CHAT_MODEL gemma4-26b-q3 Primary chat model (26B MoE, 100% GPU, 100% quality)
SIGNAL_CHAT_CTX 49152 Gemma 4 context ceiling (48K verified filled)
SIGNAL_FALLBACK_MODEL qwen3.5-35b-a3b-iq2m Qwen MoE fallback for prompts >40K tokens
SIGNAL_FALLBACK_CTX 65536 Fallback context (64K)
VISION_MODEL qwen3.5:9b Vision analysis model (multimodal)
VISION_CTX 65536 Vision context — matches global 64K ceiling
ROUTING_GEMMA4_LIMIT 40000 Switch to qwen3.5-35b-a3b-iq2m fallback above this token count
SIGNAL_CHAT_MAX_EXEC 3 Max shell commands per message
SIGNAL_EXEC_TIMEOUT_S 30 Per-command timeout
SIGNAL_MAX_REPLY 1800 Signal message character limit

5.4 Tool use — EXEC

The LLM can request shell commands via EXEC(command) in its response. queue-runner intercepts these, runs them, feeds stdout back into the conversation, and lets the LLM synthesize a final answer:

  User: "what's the disk usage?"
  LLM:  [thinking...] EXEC(df -h /)
  Runner: executes → feeds output back
  LLM:  "Root is 67% full, 148G free on your 475GB NVMe."

Supported patterns: web search (ddgr), file reads (cat, head), system diagnostics (journalctl, systemctl, df, free), data queries (jq on JSON files). Up to 3 commands per turn.

5.5 Image generation via chat

When the LLM detects an image request, it emits EXEC(/opt/stable-diffusion.cpp/generate-and-send "prompt"). queue-runner intercepts this pattern and handles it synchronously:

  1. Stop Ollama (free GPU VRAM)
  2. Run sd-cli with FLUX.2-klein-9B (4 steps, 512×512, ~105s)
  3. Send image as Signal attachment
  4. Restart Ollama

Bot is offline during generation (~3 minutes total including ESRGAN upscale and model reload).

Image editing (Kontext): Send a photo to Signal with an edit instruction ("make it cyberpunk", "add a hat"). The LLM emits EXEC(/opt/stable-diffusion.cpp/edit-image "instruction"), queue-runner runs FLUX.1-Kontext-dev with the photo as reference, and sends back the edited image (~5–10 min @512²). Input images are automatically resized to 512×512. See §6.2 for a demo (Sonic → Shadow the Hedgehog).

Video generation: Ask for a video/animation. Uses WAN 2.1 T2V 1.3B (~38 min for 17 frames @480×320).

ESRGAN upscale: Every generated image is automatically upscaled 4× with RealESRGAN_x4plus (512²→2048² in ~25s). Both versions sent via Signal — thumbnail + full-res. You can also send any photo to chat for a standalone 4× upscale.

⚠️ GFX1013 bug: sd-cli hangs after writing the output image (Vulkan cleanup). queue-runner polls for the file and kills the process.

5.6 Personality — "Clawd"

The system prompt defines a cynical, darkly funny personality ("House MD meets a sysadmin lobster"). Key traits:

  • Direct, no corporate speak, no disclaimers
  • Dark humor about the hardware constraints
  • Full access to /opt/netscan/data/ for grounded answers
  • Knows AK's professional context (embedded Linux, camera drivers, V4L2/libcamera)
  • Uncensored creative generation (abliterated model)

The personality is baked into queue-runner.py's SYSTEM_PROMPT — no external workspace files needed.

5.7 Response times

Scenario Latency
Text reply (warm) 10–30s
Complex reasoning with tool use 1–5 min
Image generation (FLUX.2-klein-9B 512²) ~105s
Image generation + auto-upscale 4× ~130s
Image editing (Kontext 512²) ~5–10 min
Video generation (WAN 2.1 480×320) ~38 min
ESRGAN 4× upscale (on-demand) ~25s
Cold start (model reload) 30–60s
Voice note transcription (≤40s) 3–5s
Vision analysis (photo → description) ~40–80s

5.8 Vision analysis

Send a photo to Signal without an edit keyword (no "draw", "generate", "create") and the bot analyzes it using qwen3.5:9b's native multimodal vision. The 9B model processes base64-encoded images via Ollama's /api/chat endpoint.

  User: [photo of a circuit board] "what chip is this?"
  Router: image + non-edit text → vision analysis (9B)
  9B:    "That's an STM32F407 — the LQFP-100 package, 168 MHz Cortex-M4."

How edit vs. analysis is decided:

Input Keywords detected Action
Photo + "make it cyberpunk" ✓ edit → Kontext image editing (§5.5)
Photo + "what is this?" → qwen3.5:9b vision analysis
Photo (no text) → qwen3.5:9b vision analysis

Example — real vision output from the Signal chatbot:

Shadow & Marshall on a floppy disk

This photo was sent to the bot with no text. The qwen3.5:9b model produced the following description (lightly edited for formatting):

This is a charming and nostalgic photo featuring two small figurines placed on a blue 3.5-inch floppy disk, which is resting on a gray outdoor table.

Figurines:

  • On the left: a black hedgehog with red stripes on his head and yellow muzzle — Shadow the Hedgehog from the Sonic the Hedgehog series, standing on a small black circular base.
  • On the right: a white Dalmatian puppy wearing a red firefighter helmet and a yellow collar with a red heart tag — Marshall from PAW Patrol, sitting upright.

Floppy Disk: A classic 3.5-inch disk labeled "2HD 1.44 MB" and "INDEX" (upside down in the image). The label area has horizontal lines like lined paper, adding to the retro aesthetic.

Background: A blurred garden with green grass, bushes, and string lights with clear glass bulbs hanging above.

Overall Vibe: The combination of modern pop culture characters (Shadow and Marshall) with retro tech (floppy disk) creates a fun, geeky, and slightly whimsical display. It's a great blend of nostalgia and fandom!

This is raw model output from a 9.7B parameter model running on the BC-250's Vulkan GPU — no cloud APIs, no preprocessing.

Second example — vintage laptop in a garden:

Laptop in garden

Same model, same hardware, different photo — sent with "Describe this photo in detail":

This is a carefully composed, sunlit outdoor photograph featuring vintage technology and playful whimsy on a balcony or patio.

Main Subject — The Vintage Laptop: Centered in the frame is a Sharp PC-4600, a rare and highly collectible 1980s Japanese laptop computer. It's housed in a bulky, rectangular chassis with rounded corners, rendered in a muted beige or tan plastic — typical of that era's design aesthetic. The screen is black and turned off, framed by a thick bezel. Below the screen, a silver label reads "SHARP PC-4600." The keyboard layout includes standard QWERTY keys plus function keys, number pad, and system controls like POWER, LOW BATTERY, DRIVE A/B.

Secondary Object — Paw Patrol Toy Car: To the right of the laptop rests a colorful toy car from the children's show Paw Patrol. It's Rocky, the green recycling pup, seated inside his signature garbage truck. The toy features bright green and orange plastic, with gray accents and small wheels. Rocky wears his characteristic white helmet with a paw logo. This juxtaposition creates an amusing contrast between high-tech nostalgia and childhood playthings.

Setting — Outdoor Balcony/Patio: The scene is set outdoors on a wooden deck or table with slatted planks. Behind the subjects runs a modern metal railing with vertical bars and a horizontal handrail strung with clear, crystal-like bulb lights — some glowing softly, others unlit. Beyond the railing lies a lush garden: green grass, leafy bushes, and tall evergreen trees under a pale blue sky.

Mood & Atmosphere: The image evokes a sense of quiet nostalgia mixed with lighthearted fun. The vintage computer speaks to the dawn of personal computing — clunky but pioneering — while the Paw Patrol toy injects innocence and humor. The natural backdrop adds serenity and warmth, making it feel like a relaxed afternoon spent admiring curiosities.

Key detail: qwen3.5:9b requires "think": false in the API call. With thinking enabled, the model produces only hidden thinking tokens and returns an empty visible response. Discovered via 7 iterative tests (tests 1–6 all returned empty content).

The MoE model (qwen3.5-35b-a3b-iq2m) did not handle images through the local Ollama/GGUF deployment path — image requests returned HTTP 500 in this configuration. Although upstream Qwen3.5-35B-A3B is described as a multimodal model (HuggingFace model card, Ollama library), the local Ollama/GGUF deployment path did not expose working vision capability. Based on this, model routing delegates all image tasks to qwen3.5:9b.

5.9 Audio transcription

Send a voice note to Signal and the bot transcribes it using whisper.cpp with Vulkan GPU acceleration:

  User: [voice note, 15 seconds, Polish]
  Router: audio/* → whisper-cli (auto language detection)
  Whisper: "Hej, sprawdź mi pogodę na jutro" (pl, 15.2s audio)
  Router: → feed transcription to LLM for response
  LLM:   "Jutro 18°C, częściowe zachmurzenie..."

Whisper setup on BC-250:

Component Value
Runtime whisper.cpp (Vulkan, built from source)
Model ggml-large-v3-turbo (1.6 GB)
Binary /opt/whisper.cpp/build/bin/whisper-cli
Threads 6 (all Zen 2 cores)
Language Auto-detect (EN/PL confirmed)

Why large-v3-turbo, not large-v3?

Both models were benchmarked with real English TTS speech (flite) at three durations. The speed difference is modest (~2×), but memory is the dealbreaker — the larger model doesn't fit alongside Ollama in 16 GB.

Speed comparison:

Whisper Wall Time

Audio large-v3-turbo large-v3 Speedup
3.6s 3.3s 7.9s 2.4×
18.2s 3.5s 8.9s 2.6×
39.2s 4.3s 8.1s 1.9×

The memory problem:

The BC-250 has 16 GB total (UMA — shared between CPU and GPU). The qwen3.5-35b-a3b-iq2m (Qwen MoE) takes 10.6 GB. OS and buffers need ~3.5 GB. That leaves the memory budget looking like this:

Whisper Memory Budget

Scenario Ollama Whisper OS/buffers Total Fits 16 GB?
Ollama only 10.6 GB 3.5 GB 14.1 GB ✅ 1.9 GB free
+ large-v3-turbo 10.6 GB 1.6 GB 3.5 GB 15.7 GB ✅ 0.3 GB free
+ large-v3 10.6 GB 2.9 GB 3.5 GB 17.0 GB ❌ 1.0 GB overflow → swap

When the total exceeds 16 GB, the kernel pushes pages to NVMe swap. This shows up as a measurable swap delta:

Whisper Memory Impact

large-v3 pushes ~1 GB into swap on first load. large-v3-turbo caused no measurable swap increase in testing. Once pages are evicted, subsequent large-v3 runs may show 0 swap delta (the 39s test) because those pages were already swapped out by earlier runs — but the damage (swap pressure, latency spikes) already happened.

Quality is comparable. Both models tested on a 39s embedded-systems passage (flite TTS). Both made the same synthesis artifacts ("kilobots" for "kilobytes", "Wipcomer" for "libcamera"). Neither is clearly better on robotic TTS.

Verdict: large-v3-turbo — 2× faster, 45% smaller, no observable swap pressure in testing on this setup. The quality difference was not distinguishable in this testing.

5.10 Smart model routing

queue-runner automatically selects the appropriate model for each message based on content:

def choose_chat_model(user_text, has_image=False):
    if has_image:
        return "qwen3.5:9b", 65536      # vision + full 64K context
    if estimate_tokens(user_text) > ROUTING_GEMMA4_LIMIT:
        return "qwen3.5-35b-a3b-iq2m", 65536  # Qwen MoE fallback for >40K
    return "gemma4-26b-q3", 49152              # Gemma MoE primary, 48K ctx

Model Routing Speed

Route Model Speed When
Default gemma4-26b-q3 39.0 tok/s Normal chat (≤40K tokens, most messages)
Fallback qwen3.5-35b-a3b-iq2m 37.5 tok/s Prompt > 40K tokens
Vision qwen3.5:9b 31.7 tok/s Photo attached (no edit keywords)

The gemma4-26b-q3 (Gemma MoE, 128 experts, 8+1 active, ~3.8B active/token) is the primary chat model — fastest at 39.0 tok/s, 100% GPU offload, 100% quality score. The qwen3.5-35b-a3b-iq2m (Qwen MoE, 35B total, 3B active/token, 37.5 tok/s) serves as fallback when prompts exceed 40K tokens, where gemma4's 48K ceiling leaves insufficient headroom. Both are MoE architectures, but with very different expert configurations. The 9B is reserved for vision — a capability neither MoE exposes in local Ollama runtime.


6. Image Generation

Stable Diffusion via stable-diffusion.cpp with native Vulkan backend.

▸ Build from source
sudo dnf install -y vulkan-headers vulkan-loader-devel glslc git cmake gcc g++ make
cd /opt && sudo git clone --recursive https://github.com/leejet/stable-diffusion.cpp.git
sudo chown -R $(whoami) /opt/stable-diffusion.cpp && cd stable-diffusion.cpp
mkdir -p build && cd build && cmake .. -DSD_VULKAN=ON -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

6.1 Models

FLUX.2-klein-9B — recommended, best visual quality observed in side-by-side testing, Apache 2.0:

mkdir -p /opt/stable-diffusion.cpp/models/flux2 && cd /opt/stable-diffusion.cpp/models/flux2
# Diffusion model (9B, Q4_0, 5.3 GB)
curl -L -O "https://huggingface.co/leejet/FLUX.2-klein-9B-GGUF/resolve/main/flux-2-klein-9b-Q4_0.gguf"
# Qwen3-8B text encoder (Q4_K_M, 4.7 GB)
curl -L -o qwen3-8b-Q4_K_M.gguf "https://huggingface.co/unsloth/Qwen3-8B-GGUF/resolve/main/Qwen3-8B-Q4_K_M.gguf"
# FLUX.2 VAE (321 MB) — different from FLUX.1 VAE!
curl -L -o flux2-vae.safetensors "https://huggingface.co/Comfy-Org/vae-text-encorder-for-flux-klein-4b/resolve/main/split_files/vae/flux2-vae.safetensors"

Memory: 5.3 GB VRAM (diffusion) + 6.2 GB VRAM (Qwen3-8B encoder) + 95 MB (VAE) = ~11.8 GB total. Uses 11.8 of the 16.5 GB Vulkan pool.

FLUX.2-klein-4B — fast alternative, Apache 2.0:

cd /opt/stable-diffusion.cpp/models/flux2
# Diffusion model (4B, Q4_0, 2.3 GB)
curl -L -O "https://huggingface.co/leejet/FLUX.2-klein-4B-GGUF/resolve/main/flux-2-klein-4b-Q4_0.gguf"
# Qwen3-4B text encoder (Q4_K_M, 2.4 GB)
curl -L -o qwen3-4b-Q4_K_M.gguf "https://huggingface.co/unsloth/Qwen3-4B-GGUF/resolve/main/Qwen3-4B-Q4_K_M.gguf"
# Reuses same flux2-vae.safetensors from above

Memory: 2.3 GB VRAM (diffusion) + 3.6 GB VRAM (Qwen3-4B encoder) + 95 MB (VAE) = ~6 GB total. 7× faster than 9B but lower quality. Good for quick previews.

FLUX.1-schnell — previous default, Apache 2.0:

mkdir -p /opt/stable-diffusion.cpp/models/flux && cd /opt/stable-diffusion.cpp/models/flux
curl -L -O "https://huggingface.co/second-state/FLUX.1-schnell-GGUF/resolve/main/flux1-schnell-q4_k.gguf"
curl -L -O "https://huggingface.co/second-state/FLUX.1-schnell-GGUF/resolve/main/ae.safetensors"
curl -L -O "https://huggingface.co/second-state/FLUX.1-schnell-GGUF/resolve/main/clip_l.safetensors"
curl -L -O "https://huggingface.co/city96/t5-v1_1-xxl-encoder-gguf/resolve/main/t5-v1_1-xxl-encoder-Q4_K_M.gguf"

Memory: 6.5 GB VRAM (diffusion) + 2.9 GB RAM (T5-XXL Q4_K_M) = ~10 GB total.

Chroma flash Q4_0 — alternative, open-source:

Download from Chroma-GGUF repo — exact filenames may change between versions. Reuses existing T5-XXL and FLUX.1 ae.safetensors from above.

Memory: 5.1 GB VRAM (diffusion) + 3.2 GB RAM (T5-XXL) = ~8.4 GB total.

SD-Turbo — fast fallback, lower quality:

cd /opt/stable-diffusion.cpp/models
curl -L -o sd-turbo.safetensors \
  "https://huggingface.co/stabilityai/sd-turbo/resolve/main/sd_turbo.safetensors"

6.2 Performance

sd.cpp master-504-636d3cb, Vulkan GFX1013 (16.5 GiB), Ollama stopped.

Important: FLUX GGUF files must use --diffusion-model flag, not -m. The -m flag fails with "get sd version from file failed" because GGUF metadata is empty after tensor name conversion. FLUX.2-klein models must use --llm (not --t5xxl) for the Qwen3 encoder — the tensor naming differs between LLM and T5 architectures.

🏆 FLUX.2-klein-9B Q4_0 — default (best visual quality observed in side-by-side testing):

Resolution Steps Time s/step Notes
512×512 4 67s 16.8 Default, ~11.8 GB VRAM total
768×768 4 97s 24.2 VAE tiling
1024×1024 4 147s 36.8 VAE tiling
512×512 8 ❌ FAIL OOM at higher step count

FLUX.2-klein-9B uses a Qwen3-8B LLM as text encoder — in this testing, it showed better prompt following and finer detail than the 4B variant. Uses 11.8 GB of the 16.5 GB Vulkan pool. The --offload-to-cpu and --llm flags are required.

FLUX.2-klein-4B Q4_0 — fast alternative:

Resolution Steps Time s/step Notes
512×512 4 37s 9.2 Fast preview, ~6 GB VRAM total
768×768 4 52s 13.0 VAE tiling
1024×1024 4 82s 20.5 VAE tiling
512×512 8 42s 5.2 GPU warm, more quality
1024×1024 8 122s 15.2 VAE tiling

2× faster than 9B. Good for quick previews or batch generation. 1024² works reliably at both 4 and 8 steps.

FLUX.1-schnell Q4_K — previous default:

Resolution Steps Time Notes
512×512 4 107s ~10 GB VRAM (6.5 diffusion + 3.4 encoders)
768×768 4 92s VAE tiling
1024×1024 4 148s VAE tiling, good quality

FLUX.1-kontext-dev Q4_0 — image editing:

Resolution Steps Time Notes
512×512 20 132s Uses -r flag for reference image, CLIP+T5
768×768 20 282s VAE tiling

Kontext is a dedicated image editing model. Takes a reference image via -r and a text instruction to produce an edited version.

Chroma flash Q4_0 — quality alternative (reuses T5+VAE from FLUX.1):

Resolution Steps Time Notes
512×512 4 67s T5-XXL encoder
512×512 8 97s Better quality
768×768 8 158s VAE tiling

FLUX.1-dev Q4_K_S — high-quality, slow (city96/FLUX.1-dev-gguf, 6.8 GB):

Resolution Steps Time Notes
512×512 20 167s ~6.6 GB VRAM
768×768 20 ❌ FAIL Guidance model compute graph exceeds VRAM

SD3.5-medium Q4_0:

Resolution Steps Time Notes
512×512 28 102s CLIP-L + CLIP-G + T5-XXL
768×768 28 192s VAE tiling
1024×1024 28 337s VAE tiling

SD-Turbo — fast fallback:

Resolution Steps Time Notes
512×512 1 22s Minimum viable, ~2 GB VRAM
512×512 4 27s
768×768 4 32s Decent for quick previews
1024×1024 4 62s VAE tiling — newly tested, works

Head-to-head comparison (512×512, same prompt, seed 42):

Model Time @512² Steps VRAM Encoder
SD-Turbo 27s 4 2 GB built-in
FLUX.2-klein-4B 37s 4 6 GB Qwen3-4B (--llm)
FLUX.2-klein-9B 67s 4 11.8 GB Qwen3-8B (--llm)
Chroma flash 67s 4 8.4 GB T5-XXL
SD3.5-medium 102s 28 6 GB CLIP+T5
FLUX.1-schnell 107s 4 10 GB CLIP+T5
FLUX.1-kontext-dev 132s 20 10 GB CLIP+T5 (+ ref image)
FLUX.1-dev 167s 20 10 GB CLIP+T5

FLUX.2-klein-9B replaces schnell as the preferred default: faster (67s vs 107s @512²) and subjectively better in prompt following and fine detail during side-by-side tests. klein-4B is the speed champion (37s) when quality can be traded.

Summary: recommended settings for production

Use case Model Resolution Steps Time
Default (Signal) FLUX.2-klein-9B 512×512 4 ~67s
High quality FLUX.2-klein-9B 768×768 4 ~97s
Quick preview FLUX.2-klein-4B 512×512 4 ~37s
Poster/wallpaper FLUX.2-klein-4B 1024×1024 8 ~122s
Highest quality (slow) Chroma flash 512×512 8 ~97s
# FLUX.2-klein-9B — recommended production command:
/opt/stable-diffusion.cpp/build/bin/sd-cli \
  --diffusion-model models/flux2/flux-2-klein-9b-Q4_0.gguf \
  --vae models/flux2/flux2-vae.safetensors \
  --llm models/flux2/qwen3-8b-Q4_K_M.gguf \
  -p "your prompt here" \
  --cfg-scale 1.0 --steps 4 -H 512 -W 512 \
  --offload-to-cpu --diffusion-fa -v \
  -o output.png

6.2.1 Upgrade roadmap — beyond the current stack

sd.cpp (master-504+) supports more models. The BC-250 has ~16.5 GB with Ollama stopped (post-GTT migration). All models use --offload-to-cpu (UMA — no PCIe penalty).

Image generation — tested models:

Model Params GGUF Size Total RAM¹ Steps Quality Status
FLUX.2-klein-9B Q4_0 9B 5.3 GB ~11.8 GB 4 ★★★★ ✅ Current default, 67s @512²
FLUX.2-klein-4B Q4_0 4B 2.3 GB ~6 GB 4 ★★★ ✅ Fast alternative, 37s @512²
FLUX.1-schnell Q4_K 12B 6.5 GB ~10 GB 4 ★★ ✅ Previous default, 107s @512²
Chroma flash Q4_0 12B 5.1 GB ~8.4 GB 4–8 ★★★ ✅ Tested — 67s @512², good quality
FLUX.1-dev Q4_K_S 12B 6.8 GB ~10 GB 20 ★★★★ ✅ Tested — 167s @512², ❌768²+
SD-Turbo 1.1B ~2 GB ~2.5 GB 1–4 ✅ Fast preview, 22s @512²
SD3.5-medium Q4_0 2.5B 1.7 GB ~6 GB 28 ★★★ ✅ Tested — 102s @512², scales to 1024² (337s)

¹ Total RAM includes diffusion model + text encoder(s) + VAE.

³ BF16 VAE gotcha — see SD3.5 section below.

Video generation — tested models:

Model Params GGUF Size Total RAM¹ Frames Time Status
WAN 2.1 T2V 1.3B Q4_0 1.3B 826 MB ~5 GB 17 @480×320 ~38 min ✅ Works on BC-250

WAN requires umt5-xxl text encoder (3.5 GB Q4_K_M) + WAN VAE (243 MB). Outputs raw AVI (MJPEG). No matrix cores = slow but works.

Video generation — tested (OOM):

Model Params GGUF Size Total RAM¹ Notes
WAN 2.2 TI2V 5B Q4_0 5B 2.9 GB ~9 GB ❌ OOM crash at Q4_0. Model (2.9G) + VAE (1.4G) + T5 (4.7G) = 9 GB — exceeds UMA budget during video denoising. May work with Q2_K model + Q2_K T5 (~6 GB) but untested.

Image editing — FLUX.1-Kontext-dev:

Model Params GGUF Size Total RAM¹ Status
FLUX.1-Kontext-dev Q4_0 12B 6.8 GB ~10 GB ✅ Tested — 132s @512² (20 steps), 282s @768². Uses -r flag, reuses FLUX.1 T5/CLIP/VAE

Kontext is a dedicated image editing model by Black Forest Labs. It takes a reference image via -r and a text instruction to produce an edited version. Uses existing FLUX.1 encoders (T5-XXL, CLIP_L) and VAE (ae.safetensors) from /opt/stable-diffusion.cpp/models/flux/.

# Edit an existing image with Kontext:
sd-cli --diffusion-model models/flux/flux1-kontext-dev-Q4_0.gguf \
  --vae models/flux/ae.safetensors --clip_l models/flux/clip_l.safetensors \
  --t5xxl models/flux/t5-v1_1-xxl-encoder-Q4_K_M.gguf --clip-on-cpu \
  -r input.png -p "change the sky to sunset" --cfg-scale 3.5 --steps 28 \
  --sampling-method euler --offload-to-cpu --diffusion-fa -o output.png

Kontext demo — "turn Sonic into Shadow the Hedgehog":

Input (1200×1600 → resized to 512×512) Output (512×512, 647s) Output + ESRGAN 4× (2048×2048, +25s)
Kontext input Kontext output Kontext 4×

The 4× upscaled version (right) is generated automatically by the ESRGAN auto-upscale pipeline — every generated/edited image gets a 2048×2048 version sent alongside the 512×512 original. Total overhead: ~25s with tile 192. See ESRGAN benchmarks below.

SD3.5-medium benchmark details

Timing breakdown (512×512, 28 steps, seed 42):

Phase Time Notes
CLIP + T5 encoding ~4s clip_l + clip_g + t5-v1_1-xxl Q4_K_M
Diffusion sampling ~95s 28 steps × ~3.4s/it (mmdit 2.1 GB on Vulkan)
VAE decode ~3s F16-converted VAE (94.6 MB)
Total 102s

Resolution scaling:

Resolution Steps Time s/step
512×512 28 102s 3.6
768×768 28 192s 6.9
1024×1024 28 337s 12.0

Model stack on disk:

Component File Size
Diffusion sd3.5_medium-q4_0.gguf 1.7 GB
CLIP-L clip_l.safetensors (shared with FLUX) 246 MB
CLIP-G clip_g.safetensors 1.3 GB
T5-XXL t5-v1_1-xxl-encoder-Q4_K_M.gguf (shared with FLUX) 2.9 GB
VAE sd3_vae_f16.safetensors (converted from BF16) 160 MB
Total on disk ~6.3 GB
# SD3.5-medium generation command:
sd-cli --diffusion-model models/sd3/sd3.5_medium-q4_0.gguf \
  --vae models/sd3/sd3_vae_f16.safetensors \
  --clip_l models/flux/clip_l.safetensors \
  --clip_g models/sd3/clip_g.safetensors \
  --t5xxl models/flux/t5-v1_1-xxl-encoder-Q4_K_M.gguf \
  -p "prompt" --cfg-scale 4.5 --sampling-method euler --steps 28 \
  -W 512 -H 512 --diffusion-fa --offload-to-cpu -o output.png

⚠ BF16 VAE gotcha: The upstream SD3 VAE (diffusion_pytorch_model.safetensors) uses BF16 tensors. In this setup (RADV Mesa 25.3.4), GFX1013 Vulkan did not handle BF16 tensors — the output was a solid blue/yellow rectangle. Fix: convert to F16 with python3 convert_vae_bf16_to_f16.py input.safetensors output.safetensors (script in /tmp/).

WAN 2.1 T2V 1.3B benchmark details

Timing breakdown (480×320, 17 frames, 50 steps, seed 42):

Phase Time Notes
umt5-xxl encoding ~4s 3.5 GB Q4_K_M text encoder
Diffusion sampling ~35 min 17 frames × 50 steps. No matrix cores → pure scalar Vulkan
VAE decode ~30s WAN VAE (243 MB), decodes all 17 frames
Total ~38 min

Model stack on disk:

Component File Size
Diffusion Wan2.1-T2V-1.3B-Q4_0.gguf 826 MB
Text encoder umt5-xxl-encoder-Q4_K_M.gguf 3.5 GB
VAE wan_2.1_vae.safetensors 243 MB
Total on disk ~4.5 GB
# WAN 2.1 text-to-video generation:
sd-cli -M vid_gen \
  --diffusion-model models/wan/Wan2.1-T2V-1.3B-Q4_0.gguf \
  --vae models/wan/wan_2.1_vae.safetensors \
  --t5xxl models/wan/umt5-xxl-encoder-Q4_K_M.gguf \
  -p "A cat walking across a sunny garden" \
  --cfg-scale 6.0 --sampling-method euler \
  -W 480 -H 320 --diffusion-fa --offload-to-cpu \
  --video-frames 17 --flow-shift 3.0 -o output.mp4

Output format: sd.cpp produces raw AVI (MJPEG) regardless of the -o extension. The 17-frame clip plays at 16 fps (~1 second). Quality is recognizable but noisy — expected at Q4_0 with scalar-only Vulkan compute.

Why so slow? Each video frame is a full diffusion pass through the 1.3B model. With 17 frames × 50 steps × no matrix cores, every multiply is scalar. A GPU with tensor/matrix units (RDNA3+, Turing+) would likely be substantially faster.

WAN 2.1 demo — "A cat walking across a sunny garden":

WAN 2.1 T2V — cat in garden

17 frames @480×320, 50 steps, Q4_0 quantization, EUR scheduler, cfg-scale 6.0. Generated in ~38 minutes on GFX1013 scalar Vulkan — no matrix/tensor cores. Noisy but recognizable — generated by a 1.3B parameter model on a secondhand BC-250.

ESRGAN 4× upscale benchmarks

All generated images are automatically upscaled with RealESRGAN_x4plus (64 MB model, 4× scaling). Runs immediately after generation while Ollama is still stopped — no additional GPU memory contention.

ESRGAN tile size benchmark (512² input → 2048² output):

Tile Size Time Output Notes
128 (default) 22s 2048×2048 Fastest
192 (production) 22s 2048×2048 Best observed quality/speed
256 22s 2048×2048 No visible difference at this input size
128 ×2 passes (16×!) 4m 50s 8192×8192, 67 MB 512²→8192² in under 5 min

Production uses tile 192: larger tiles mean fewer seam boundaries → cleaner upscale. The 16× mode (two ESRGAN passes) produces 67-megapixel images from 512² input — available on-demand via EXEC(upscale ...) but not automatic (too large for Signal).

ESRGAN upscale benchmark

Chart note: The ESRGAN chart above was generated from an earlier benchmark run. Current tile-size timings are in the table above; the chart's per-tile times are stale.

Image/video pipeline timing

Chart note: The three charts below were generated against sd.cpp master-525. Production was reverted to master-504 due to a FLUX.2-klein tensor naming regression (see §12). The tables in §6.2 and B9 reflect master-504 timings and are authoritative; the charts are preserved for relative comparison only.

End-to-end timing (sd.cpp master-525, not current production):

SD pipeline timing

Phase breakdown — where the time goes in each pipeline:

SD pipeline breakdown

FLUX.1-schnell resolution scaling — time vs pixel count (FLUX.1-schnell only; does not include FLUX.2-klein, the current production default):

FLUX resolution scaling


PART III — Monitoring & Intelligence

7. Netscan Ecosystem

A research, monitoring, and data collection system with 330 autonomous jobs running on a GPU-constrained single-board computer. Dashboard at http://<LAN_IP>:8888 — 29 main pages + 101 per-host detail pages.

7.1 Architecture — queue-runner v7

The BC-250 has 16 GB GTT shared with the CPU — only one LLM job can run at a time. queue-runner.py (systemd service) orchestrates all 330 jobs in a continuous loop, with Signal chat between every job:

queue-runner v7 -- Continuous Loop + Signal Chat

Cycle N:
  330 jobs sequential, ordered by category:
  scrape -> infra -> lore -> academic -> repo -> company -> career
         -> think -> csi -> meta -> market -> report
  HA observations interleaved every 50 jobs
  Signal inbox checked between EVERY job
  Chat processed with LLM (EXEC tool use + image gen)
  Crash recovery: resumes from last completed job

Cycle N+1:
  Immediately starts -- no pause, no idle windows
  No nightly/daytime distinction

Key design decisions (v5 → v7):

v5 (OpenClaw era) v7 (current)
Nightly batch + daytime fill Continuous loop, no distinction
354 jobs (including duplicates) 330 jobs (deduped, expanded)
LLM jobs routed through openclaw cron run All jobs run as direct subprocesses
Signal via OpenClaw gateway (~700 MB) signal-cli standalone (~100 MB)
Chat only when gateway available Chat between every job
Async SD pipeline (worker scripts, 45s delay) Synchronous SD (stop Ollama → generate → restart)
GPU idle detection for user chat preemption No preemption needed — chat is interleaved

All jobs run as direct subprocessessubprocess.Popen for Python/bash scripts, no LLM agent routing. In testing, this was roughly 3–10× faster than the old openclaw cron run path, eliminating the gateway dependency entirely.

7.1.1 Queue ordering

The queue prioritizes data diversity — all dashboard tabs get fresh data even if the cycle is interrupted. See §7.3 for the full category breakdown with GPU times. HA observations are interleaved every 50 jobs, and Signal chat is checked between every job.

7.1.2 GPU idle detection

GPU idle detection is used for legacy --daytime mode and Ollama health checks:

# Three-tier detection:
# 1. Ollama /api/ps → no models loaded → definitely idle
# 2. sysfs pp_dpm_sclk → clock < 1200 MHz → model loaded but not computing
# 3. Ollama expires_at → model about to unload → idle for 3+ min

In continuous loop mode (default), GPU detection is only used for pre-flight health checks — not for yielding to user chat, since chat is interleaved between jobs.

7.2 Scripts

GPU jobs (queue-runner — sequential, one at a time):

Script Purpose Jobs
career-scan.py Two-phase career scanner (§8) 1
career-think.py Per-company career deep analysis 65
salary-tracker.py Salary intel — NoFluffJobs, career-scan extraction 1
company-intel.py Deep company intel — GoWork, DDG news, layoffs (43 entities) 1
company-think-* Focused company deep-dives 106
patent-watch.py IR/RGB camera patent monitor — Google Patents, EPO OPS, DuckDuckGo 1
event-scout.py Meetup/conference tracker — Poland, Europe 1
leak-monitor.py CTI: 11 OSINT sources — HIBP, Hudson Rock, GitHub dorks, Ahmia dark web, CISA KEV, ransomware, Telegram 1
idle-think.sh Research brain — 8 task types → JSON notes 34
ha-journal.py Home Assistant analysis (climate, sensors, anomalies) 2
ha-correlate.py HA cross-sensor correlation 2
city-watch.py SkyscraperCity local construction tracker 1
csi-sensor-watch.py CSI camera sensor patent/news monitor 1
csi-think.py CSI camera domain analysis (drivers, ISP, GMSL) 6
lore-digest.sh Kernel mailing list digests (8 feeds) 8
repo-watch.sh Upstream repos (GStreamer, libcamera, v4l-utils, FFmpeg, LinuxTV) 8
repo-think.py LLM analysis of repo changes 26
market-think.py Market sector analysis + synthesis 19
life-think.py Cross-domain life advisor 2
system-think.py GPU/security/health system intelligence 3
radio-scan.py Radio hobbyist forum tracker 1
career-digest.py Weekly career digest → Signal (Sunday) 1
daily-summary.py End-of-cycle summary → dashboard + Signal 2
academic-watch.py Academic publication monitor (4 topics × 3 types) 12
book-watch.py Book/publication tracker (11 subjects) 11
news-watch.py Tech news aggregation + RSS 2
weather-watch.py Weather forecast + HA sensor correlation 2
car-tracker.py GPS car tracker (SinoTrack API) 1
frost-guard.py Frost/freeze risk alerter 1

CPU jobs (system crontab — independent of queue-runner):

Script Frequency Purpose
gpu-monitor.sh + .py 1 min GPU utilization sampling (3-state)
presence.sh 5 min Phone presence tracker
syslog.sh 5 min System health logger
watchdog.py 30 min (live), 06:00 (full) Network security — ARP, DNS, TLS, vulnerability scoring
scan.sh + enumerate.sh 04:00 Network scan + enumeration (nmap)
vulnscan.sh Weekly (Sun) Vulnerability scan
repo-watch.sh 08:00, 14:00, 18:00 Upstream repo data collection
report.sh 08:30 Morning report rebuild
generate-html.py After each queue-runner job Dashboard HTML builder (6900+ lines)
gpu-monitor.py chart 22:55 Daily GPU utilization chart

7.3 Job scheduling — queue-runner v7

Job categories (auto-classified by name pattern):

Category Jobs Typical GPU time Examples
scrape 29 0.1h career-scan, salary, patents, book-watch, repo-scan (no LLM)
infra 6 0.6h leak-monitor, netscan, watchdog, frost-guard, radio-scan
lore 8 0.5h lore-digest per mailing list feed
academic 12 academic-watch per topic × type
repo 27 0.3h LLM analysis of repo changes + weekly digest
company 107 0.9h company-intel + competitive/financial/strategy deep-dives
career 66 1.9h career-think per company + weekly digest
think 34 2.0h research, trends, crawl, crossfeed
csi 6 0.3h CSI camera domain analysis
meta 5 life-think, system-think
market 19 0.9h market-think per asset + synthesis
ha 4 1.0h ha-correlate, ha-journal (interleaved)
report 4 daily-summary, news + weather analysis
weekly 3 vulnscan, csi-sensor-discover/improve
Total 330 ~9h

Data flow:

jobs.json (330 jobs)
  |
  v
queue-runner.py
  |
  |-- All jobs -> subprocess.Popen -> python3/bash /opt/netscan/...
  |                                         |
  |       JSON results <--------------------+
  |         |
  |         |-- /opt/netscan/data/{category}/*.json
  |         |
  |         +-- generate-html.py -> /opt/netscan/web/*.html -> nginx :8888
  |
  |-- Signal chat (between every job)
  |     via JSON-RPC http://127.0.0.1:8080/api/v1/rpc
  |
  +-- Signal alerts (career matches, leaks, events, daily summary)

7.4 Data flow & locations

All paths relative to /opt/netscan/:

Data Path Source
Research notes data/think/note-*.json + notes-index.json idle-think.sh
Career scans data/career/scan-*.json + latest-scan.json career-scan.py
Career analysis data/career/think-*.json career-think.py
Salary data/salary/salary-*.json (180-day history) salary-tracker.py
Company intel data/intel/intel-*.json + company-intel-deep.json company-intel.py
Patents data/patents/patents-*.json + patent-db.json patent-watch.py
Events data/events/events-*.json + event-db.json event-scout.py
Leaks / CTI data/leaks/leak-intel.json leak-monitor.py
City watch data/city/city-watch-*.json city-watch.py
CSI sensors data/csi-sensors/csi-sensor-*.json csi-sensor-watch.py
HA correlations data/correlate/correlate-*.json ha-correlate.py
HA journal data/ha-journal-*.json ha-journal.py
Mailing lists data/{lkml,soc,jetson,libcamera,dri,usb,riscv,dt}/ lore-digest.sh
Repos data/repos/ repo-watch.sh, repo-think.py
Market data/market/ market-think.py
Academic data/academic/ academic-watch (LLM)
GPU load data/gpu-load.tsv gpu-monitor.sh
System health data/syslog/health-*.tsv (30-day retention) syslog.sh
Network hosts data/hosts-db.json scan.sh
Presence data/presence-state.json presence.sh
Radio data/radio/ radio-scan.py
Queue state data/queue-runner-state.json queue-runner.py

7.5 Dashboard — 29 main pages + 101 host detail pages

Served by nginx at :8888, generated by generate-html.py (6900+ lines):

Page Content Data source
index.html Overview — hosts, presence, latest notes, status aggregated
home.html Home Assistant — climate, energy, anomalies ha-journal, ha-correlate
career.html Career intelligence — matches, trends career-scan, career-think
market.html Market analysis — sectors, commodities, crypto market-think
advisor.html Life advisor — cross-domain synthesis life-think
notes.html Research brain — all think notes idle-think
leaks.html CTI / leak monitor leak-monitor
issues.html Upstream issue tracking repo-think
events.html Events calendar — Poland, Europe event-scout
lkml.html Linux Media mailing list digest lore-digest (linux-media)
soc.html SoC bringup mailing list lore-digest (soc-bringup)
jetson.html Jetson/Tegra mailing list lore-digest (jetson-tegra)
libcamera.html libcamera mailing list lore-digest (libcamera)
dri.html DRI-devel mailing list lore-digest (dri-devel)
usb.html Linux USB mailing list lore-digest (linux-usb)
riscv.html Linux RISC-V mailing list lore-digest (linux-riscv)
dt.html Devicetree mailing list lore-digest (devicetree)
academic.html Academic publications academic-watch
hosts.html Network device inventory scan.sh
security.html Host security scoring vulnscan.sh
presence.html Phone detection timeline presence.sh
load.html GPU utilization heatmap + schedule gpu-monitor
radio.html Radio hobbyist activity radio-scan.py
car.html Car tracker car-tracker
weather.html Weather forecast + HA sensor correlation weather-watch.py
news.html Tech news aggregation + RSS news-watch.py
health.html System health assessment (services, data freshness, LLM quality) bc250-extended-health.py
history.html Changelog
log.html Raw scan logs
host/*.html Per-host detail pages (101 hosts) scan.sh, enumerate.sh

Mailing list feeds are configured in digest-feeds.json — 8 feeds from lore.kernel.org, each with relevance scoring keywords.

7.6 GPU monitoring — 3-state

Per-minute sampling via pp_dpm_sclk:

State Clock Temp Meaning
generating 2000 MHz ~77°C Active LLM inference
loaded 1000 MHz ~56°C Model in VRAM, idle
idle 1000 MHz <50°C No model loaded

7.7 Configuration & state files

File Purpose
profile.json Public interests — tracked repos, keywords, technologies
profile-private.json Career context — target companies, salary expectations (gitignored)
watchlist.json Auto-evolving interest tracker
digest-feeds.json Mailing list feed URLs (8 feeds from lore.kernel.org)
repo-feeds.json Repository API endpoints
sensor-watchlist.json CSI camera sensor tracking list
queue-runner-state.json Cycle count, resume index (in data/)
/opt/netscan/data/jobs.json All 330 job definitions

7.8 Resilience

Mechanism Details
Systemd watchdog WatchdogSec=14400 (4h) — queue-runner pings every 30s during job execution
Crash recovery State file records batch progress; on restart, resumes from last completed job
Midnight crossing Resume index valid for both today and yesterday's date (batch starts 23:00 day N, may crash after midnight day N+1)
Atomic state writes Write to .tmp file, fsync(), then rename() — survives SIGABRT/power loss
Ollama health checks Pre-flight check before each job; exponential backoff wait if unhealthy
Network down Detects network loss, waits with backoff up to 10min
GPU deadlock protection If GPU busy for > 60min continuously, breaks and moves on
OOM protection Ollama OOMScoreAdjust=-1000, 16 GB NVMe swap, zram limited or disabled
Signal delivery --best-effort-deliver flag — delivery failures don't mark job as failed

8. Career Intelligence

Automated career opportunity scanner with a two-phase anti-hallucination architecture.

8.1 Two-phase design

  HTML page
    +-> Phase 1: extract jobs (NO candidate profile) -> raw job list
                                                            |
  Candidate Profile + single job ---------------------------+
    +-> Phase 2: score match -> repeat per job
                                   +-> aggregate -> JSON + Signal alerts

Phase 1 extracts jobs from raw HTML without seeing the candidate profile — reducing the risk of the LLM hallucinating matching jobs. Phase 2 scores each job individually against the profile.

8.2 Alert thresholds

Category Score Alert?
⚡ Hot match ≥70% ✅ (up to 5/scan)
🌍 Worth checking 55–69% + remote ✅ (up to 2/scan)
Good / Weak <55% Dashboard only

Software houses (SII, GlobalLogic, Sysgo…) appear on the dashboard but never trigger alerts.

8.3 Salary tracker · salary-tracker.py

Runs once per cycle (scrape category). Sources: career-scan extraction, NoFluffJobs API, JustJoinIT, Bulldogjob. Tracks embedded Linux / camera driver compensation in Poland. 180-day rolling history.

8.4 Company intelligence · company-intel.py

Runs once per cycle (company category). Deep-dives into 43 tracked companies across 8 sources: GoWork.pl reviews, DuckDuckGo news, Layoffs.fyi, company pages, 4programmers.net, Reddit, SemiWiki, Hacker News. LLM-scored sentiment (-5 to +5) with cross-company synthesis.

GoWork.pl: New Next.js SPA breaks scrapers. Scanner uses the old /opinie_czytaj,{entity_id} URLs (still server-rendered).

8.5 Patent watch · patent-watch.py

Runs once per cycle (scrape category). Monitors 6 search queries (MIPI CSI, IR/RGB dual camera, ISP pipeline, automotive ADAS, sensor fusion, V4L2/libcamera) across Google Patents, EPO OPS, and DuckDuckGo. Scored by relevance keywords × watched assignee bonus.

8.6 Event scout · event-scout.py

Runs once per cycle (scrape category). Discovers tech events with geographic scoring (local 10, nearby 8, Poland 5, Europe 3, Online 9). Sources: Crossweb.pl, Konfeo, Meetup, Eventbrite, DDG, 14 known conference sites.


PART IV — Comprehensive Benchmarks

32 LLM models, 5 measurement phases, 8 image generation models. All measurements on a single BC-250 board. Statistically validated (CV <1.5% across 8 models at 4K). Quality scored by Python script (keyword match, JSON parse, regex).

Metric Value
LLM models tested 32
Quality score (median) 100% (benchmark ceiling — even 3B models score 93%)
Models reaching 64K filled context 25 of 32
Fastest model 103.8 tok/s (llama3.2:3b)
Primary chat speed 39.0 tok/s (gemma4-26b-q3, 26B MoE at Q3_K_M)
Statistical reliability CV < 1.5% (8 models, 3 runs each)
Image gen models 8 (27s–167s @ 512²)

B1. Methodology

B1.1 Benchmark suite

Five measurement phases:

Phase Validated scope Prompt Runs Key metric
Perf 32 models @ 4K Standard ~400 tok 1 gen, prefill, TTFT, VRAM, GPU%, layers, swap
Stats 8 models @ 4K Standard ~400 tok 3 Median, min, max, coefficient of variation
Context 30 models with usable data (of 32 attempted — 1 broken, 1 failed to load) 80% fill block 1–2 Gen degradation, prefill scaling, TTFT, swap, truncation detection
Quality All 32 models 5 task types 3 Summarization, JSON, fact recall, instruction, arithmetic
Cold 2 production models Standard ~400 tok 3 Cold-start TTFT (unload → first token)

Platform: Fedora 43, kernel 6.18.9, Mesa 25.3.4 RADV, Vulkan 1.4.328, Ollama 0.18.0. Q4_0 KV cache. All services stopped during measurement. Model unloaded between tests.

Environment controls:

  • Swap: NVMe-backed only (16 GiB file on NVMe). zram was set to disksize=0 (device exists but inactive — see §3.5). Swap usage recorded via /proc/meminfo before and after each test.
  • Software versions: All five phases ran on the identical software stack listed above. No package updates between phases.
  • Page cache: OS page cache was not dropped between runs. After the first model load, GGUF file pages remain in the Linux page cache, so subsequent cold-start loads read from RAM rather than NVMe. This explains the qwen3.5:9b Run 1 → Run 2 gap in B7 (11.9s → 6.8s). Prefill and generation measurements are unaffected because they are GPU-compute-bound, not I/O-bound.
  • KV state: Ollama discards all KV cache when a model is unloaded (ollama stop or OLLAMA_KEEP_ALIVE expiry). Repeated runs start with a cold KV cache. Prefill timings therefore reflect full prompt processing, not cached attention state.

B1.2 How we measure

Image

  • Prompt standardization: All performance tests use a single ~400-token prompt (RISC vs CISC architectures) with num_predict=100
  • Filled context: Context scaling fills 80% of num_ctx with real English text (~500 tok per block) and verifies prompt_eval_count matches expected tokens — catches silent truncation
  • Quality scoring: 5 tasks with deterministic pass/fail checks, executed by a Python scoring script:
    • Summarization — keyword presence + sentence count
    • JSON extraction — valid parse + keys + values
    • Fact recall — target keywords present
    • Instruction following — correct number of items
    • Arithmetic — correct answer for 17 × 23
  • Statistical validation: 3 runs per model, CV calculated. Phase 3 context-scaling pairs confirm low variance (mean 0.55%, max 2.4%) across all context levels

B1.3 What "filled context" means (and why it matters)

The single most important finding: Allocating 128K context (tiny prompt + large num_ctx) always succeeds, but filling 128K with real tokens times out (TTFT >20 min) for every model tested. Prior Ollama benchmarks that report "128K context" without filling it are misleading.

Ollama also silently truncates some models to their native context limit without any error. Verified: qwen2.5:3b → 32K native, phi4:14b → 16K native. The prompt_eval_count field is the only reliable indicator.


B2. Statistical Validation

CV < 1.5% for all models — single-run measurements are reliable on this thermally steady UMA system.

Model Gen median Range CV%
qwen3:14b 26.6 [26.6 – 26.7] 0.2%
mistral-nemo:12b 34.0 [33.9 – 34.0] 0.2%
qwen3:8b 42.8 [42.8 – 43.0] 0.3%
★ qwen3.5:9b 31.7 [31.7 – 31.9] 0.4%
★ MoE 35B-A3B 37.5 [37.3 – 37.6] 0.4%
Qwen3-30B-A3B (Q2_K) 58.5 [57.9 – 58.9] 0.9%
llama3.2:3b 102.2 [101.3 – 103.9] 1.3%
phi4-mini 86.1 [85.0 – 87.4] 1.4%

The largest models show the tightest variance (0.2%). Smaller models show slightly more due to measurement granularity at higher speeds.

Statistical validation — CV and speed ranges


B3. Generation Speed

Standard prompt (~400 tokens), num_predict=100, num_ctx=4096, single run. Sorted by generation speed. Two near-duplicate 14B profiles are omitted from this ranking: qwen3-14b-16k and qwen3-14b-abl-nothink (alternate tag of huihui_ai/qwen3-abliterated:14b).

# Model Params Quant Gen tok/s Prefill tok/s TTFT VRAM Quality
1 llama3.2:3b 3.2B Q4_K_M 103.8 484.8 2.8s 2.2 GiB 93%
2 qwen2.5:3b 3.1B Q4_K_M 102.0 477.9 5.0s 2.1 GiB 73%
3 phi4-mini 3.8B Q4_K_M 87.0 346.3 6.2s 2.5 GiB 93%
4 gemma3:4b 4B Q4_K_M 76.5 357.1 6.5s 3.8 GiB 100%
5 qwen3:4b 4B Q4_K_M 73.6 314.0 4.0s 2.9 GiB 33%⁶
6 Qwen3-Coder-30B-A3B 30.5B/3.3B UD-IQ2_M 62.2 149.0 11.0 GiB 87%
7 Qwen3-30B-A3B (Q2_K) 30.5B/3B Q2_K 59.0 131.5 17.0s 10.7 GiB 27%⁶
9 qwen2.5-coder:7b 7.6B Q4_K_M 54.8 247.3 8.9s 4.4 GiB 40%
10 llama3.1:8b 8.0B Q4_K_M 51.3 196.9 9.9s 4.7 GiB 93%
11 seed-coder-abliterate:8b 8.3B Q4_K_M 50.8 216.7 9.8s 4.8 GiB 87%
12 lexi-8b (uncensored) 8.0B Q4_0 49.9 299.0 10.2s 4.5 GiB 100%
13 granite3.3:8b 8B Q4_K_M 45.8 173.0 8.9s 4.9 GiB 80%
14 qwen3-abl-nothink:8b 8.2B Q4_K_M 45.6 192.7 7.7s 4.9 GiB 100%
15 qwen3-abliterated:8b 8.2B Q4_K_M 45.5 208.8 3.4s 4.9 GiB 100%
16 glm4:9b 9B Q4_K_M 44.9 201.4 11.2s 5.1 GiB 93%
17 deepseek-r1:8b 8B Q4_K_M 43.2 184.8 8.0s 5.1 GiB 73%
18 qwen3:8b 8.2B Q4_K_M 43.1 192.7 7.7s 5.1 GiB 100%
19 qwen3:8b-nothink 8.2B Q4_K_M 43.1 209.7 3.4s 5.1 GiB 100%
20 gemma2:9b 9.2B Q4_0 38.2 194.8 12.5s 6.9 GiB 100%
21 MoE 35B-A3B 35B/3B UD-IQ2_M 37.5 127.5 17.4s 12.3 GiB 93%
22 mistral-nemo:12b 12.2B Q4_0 34.1 159.8 13.0s 6.7 GiB 80%
23 qwen3.5:9b 9.7B Q4_K_M 31.7 171.3 11.2s 7.9 GiB 100%
24 qwen3:8b-q8_0 8.2B Q8_0 31.2 237.2 12.6s 8.5 GiB 100%
25 gemma3:12b 12B Q4_K_M 29.1 135.1 12.9s 8.7 GiB 100%
26 deepseek-r1:14b 14B Q4_K_M 28.7 101.4 16.0s 8.5 GiB 100%
27 phi4:14b 14.7B Q4_K_M 28.6 108.2 15.8s 8.5 GiB 100%
28 qwen3-abliterated:14b 14.8B Q4_K_M 27.4 110.6 13.2s 8.7 GiB 100%
29 qwen3:14b 14.8B Q4_K_M 26.8 108.6 13.3s 8.9 GiB 100%
30 qwen2.5:7b 7.6B Q4_K_M 55.0² 147.9² —² 4.4 GiB 20%²
31 qwen3.5-27b-iq2m 26.9B IQ2_M 11.0 54.2 17.6s 13.4 GiB 0%⁷

★ = production model. ² = intermittent loading bug (72% failure rate). ⁶ = think tokens leak into response. ⁷ = all quality tasks timed out.

All 33 models run at 100% GPU offload after GTT tuning (16 GiB). The qwen3.5-35b-a3b-iq2m's (Qwen MoE) 850 MB swap is OS pages pushed to NVMe — not model weights.

Generation speed — all models

Speed vs Quality

Speed vs quality scatter

Bubble size = parameter count. Gold = production models. The "sweet spot" is the upper-right quadrant: fast + high quality. Note: the quality benchmark uses simple tasks where most models score 90%+ — it does not measure reasoning depth or generation nuance where larger models are expected to outperform smaller ones.

VRAM Usage

VRAM usage — all models

All models fit within the 16.5 GiB Vulkan budget. The Qwen MoE fallback (12.3 GiB) leaves ~4 GiB free at 4K context — sufficient for KV cache growth up to 64K filled.


B4. Quality Assessment

5 tasks × 3 runs per model. Scored by Python script (keyword match, JSON parse, regex, exact number). All 32 models tested.

# Model Sum JSON Fact Instr Arith Total %
1 gemma3:4b 3/3 3/3 3/3 3/3 3/3 15/15 100
2 lexi-8b (uncensored) 3/3 3/3 3/3 3/3 3/3 15/15 100
3 qwen3-abl-nothink:8b 3/3 3/3 3/3 3/3 3/3 15/15 100
4 qwen3-abliterated:8b 3/3 3/3 3/3 3/3 3/3 15/15 100
5 qwen3:8b 3/3 3/3 3/3 3/3 3/3 15/15 100
6 qwen3:8b-nothink 3/3 3/3 3/3 3/3 3/3 15/15 100
7 qwen3:8b-q8_0 3/3 3/3 3/3 3/3 3/3 15/15 100
8 gemma2:9b 3/3 3/3 3/3 3/3 3/3 15/15 100
9 ★ qwen3.5:9b 3/3 3/3 3/3 3/3 3/3 15/15 100
10 ★ gemma4-26b-q3 3/3 3/3 3/3 3/3 3/3 15/15 100
11 gemma3:12b 3/3 3/3 3/3 3/3 3/3 15/15 100
11 phi4:14b 3/3 3/3 3/3 3/3 3/3 15/15 100
12 huihui_ai/qwen3-abliterated:14b 3/3 3/3 3/3 3/3 3/3 15/15 100
13 qwen3-14b-abl-nothink (same model, alt tag) 3/3 3/3 3/3 3/3 3/3 15/15 100
14 qwen3-14b-16k 3/3 3/3 3/3 3/3 3/3 15/15 100
15 qwen3:14b 3/3 3/3 3/3 3/3 3/3 15/15 100
16 deepseek-r1:14b 3/3 3/3 3/3 3/3 3/3 15/15 100
17 ★ MoE 35B-A3B 3/3 3/3 3/3 2/3 3/3 14/15 93
18 phi4-mini 3/3 3/3 3/3 2/3 3/3 14/15 93
19 llama3.2:3b 3/3 3/3 3/3 2/3 3/3 14/15 93
20 llama3.1:8b 3/3 3/3 3/3 2/3 3/3 14/15 93
21 glm4:9b 2/3 3/3 3/3 3/3 3/3 14/15 93
22 Qwen3-Coder-30B-A3B 1/3 3/3 3/3 3/3 3/3 13/15 87
23 seed-coder-abliterate:8b 3/3 3/3 3/3 3/3 1/3 13/15 87
24 granite3.3:8b 3/3 3/3 3/3 3/3 0/3 12/15 80
25 mistral-nemo:12b 3/3 3/3 3/3 3/3 0/3 12/15 80
26 qwen2.5:3b 3/3 0/3 3/3 2/3 3/3 11/15 73
27 deepseek-r1:8b 3/3 3/3 3/3 2/3 0/3 11/15 73
28 qwen2.5-coder:7b 0/3 0/3 3/3 0/3 3/3 6/15 40
29 qwen3:4b ⁶ 1/3 0/3 3/3 0/3 1/3 5/15 33
30 Qwen3-30B-A3B (Q2_K) ⁶ 0/3 0/3 3/3 0/3 1/3 4/15 27
31 qwen2.5:7b² 0/3 0/3 3/3 0/3 0/3 3/15 20
32 qwen3.5-27b-iq2m⁷ 0/3 0/3 0/3 0/3 0/3 0/15 0

⁶ Think tokens leak into visible response — scores reflect token budget exhaustion, not true capability. ² qwen2.5:7b has a 72% intermittent load failure; outputs gibberish when loaded. Only fact recall passes (keyword "W" found). ⁷ qwen3.5-27b-iq2m: all 15 tasks timed out at 180s or model failed to load entirely.

Quality tier summary:

  • 100% — 17 models (all 14B, all Qwen3 8B, gemma3:4b/12b, gemma2:9b, lexi-8b, qwen3.5:9b, gemma4-26b-q3)
  • 93% — 5 models (35B MoE, phi4-mini, llama3.2:3b, llama3.1:8b, glm4:9b) — each missed one task
  • 87% — 2 models (Qwen3-Coder-30B-A3B missed summarize; seed-coder-abliterate:8b missed arithmetic)
  • 80% — 2 models (granite3.3:8b, mistral-nemo:12b fail arithmetic)
  • 73% — 2 models (qwen2.5:3b JSON fails, deepseek-r1:8b arithmetic fails)
  • ≤40% — 3 models (think-leak or task-specialized)
  • 20% — 1 model (qwen2.5:7b — intermittent loading bug, gibberish output)
  • 0% — 1 model (qwen3.5-27b-iq2m — all tasks timed out or load failure)

Quality scores — all models

Quality by task category

Arithmetic (17×23) is the hardest task — 8 models score below 3/3 (5 at 0/3, 3 at 1/3). Fact recall is the easiest — every testable model passes. Failure patterns are model-specific, not hardware-related.


B5. Context Scaling — Filled Context

Methodology: 80% real-token fill with prompt_eval_count truncation detection. Testing depth varied: 6 Phase 3 core models (2 runs per config), 2 gap-closer models (1–2 runs), 22 sweep models (1 run, validated by Phase 2 CV <1.5%), 2 models from the extended benchmark (same methodology, 4K–128K range).

Coverage: 31 of 33 models completed filled-context testing; 25 reach the 64K ceiling, 1 reaches 48K (gemma4-26b-q3). Two models could not produce results: qwen2.5-coder:7b (pec=0 at all fills) and qwen3.5-27b-iq2m (warmup failure).

B5.1 Production models — speed vs filled context

Measurement history: Three independent measurement rounds confirm 4K–32K results within ±1 tok/s. At 64K, a significant regression appeared between the initial round and a later retest — documented below as an open investigation item.

Model 4K 16K 32K 48K 64K (initial) 64K (after uptime) Degradation (4K→32K)
MoE 35B-A3B 35.6 31.9 28.5 22.9 0.7 ⚠️ −20%
qwen3.5:9b 31.1 29.4 27.0 23.4 −13%
🏆 gemma4-26b-q3 35.2 31.1 27.7 25.0 TIMEOUT −29% (4K→48K)
phi4-mini 74.3 48.7 33.2 20.3 −55%
qwen3:8b 39.4 30.3 22.5 15.4 −43%
qwen3:14b 25.2 20.7 16.7 12.0 −34%
gemma3:4b 74.8 72.3 70.0 65.1 −6% 🏆
gemma3:12b 28.4 27.5 26.3 24.2 −7%

⚠️ 64K regression under investigation: The MoE ran 64K filled context at 22.9 tok/s in the initial round (Phase 3, isolated cold run, 302s TTFT, 38K prompt tokens, 328 MB swap delta). In a later isolated retest (same script methodology), 64K produced only 0.7 tok/s (596s TTFT, 30K prompt tokens). The system had 19h uptime at retest time. 4K–32K reproduced within ±1 tok/s across all three rounds. Likely cause: UMA memory fragmentation after extended uptime — confirmed by §4.10 fresh-reboot test where Ollama achieved 28.7 tok/s at 64K. The initial value (22.9 tok/s) is preserved as the known-good reference.

4K–32K values: median of 3 measurement rounds (initial, batch sweep, isolated retest). gemma3:4b shows the least degradation overall. All Qwen MoE 64K data shows 100% GPU offload (41/41 layers) in both tests.

Context degradation — production models

B5.2 Full filled-context sweep

Three measurement rounds (R1: initial, R2: batch sweep, R3: isolated retest). 4K–32K values are consistent within ±1 tok/s across rounds. R2 (19 models sequential, 600s timeout) caused some models to fail due to VRAM contention between models — these are marked. R3 ran Qwen MoE and deepseek-r1:14b in isolation to verify.

Model 4K 16K 32K 64K Ceiling Rounds
llama3.2:3b 87.8 56.3 38.3 23.3 64K R1+R2
gemma3:4b 74.8 72.3 70.0 65.1 64K R1+R2
qwen3:4b 61.4 40.1 28.5 17.6 64K R1+R2
Qwen3-30B-A3B (Q2_K) 53.6 40.1 30.0 20.4 → —⁵ 64K32K R1+R2
Qwen3-Coder-30B-A3B 58.4 42.8 32.6 22.9 64K R1
llama3.1:8b 47.0 35.6 26.5 17.6 64K R1+R2
seed-coder-abliterate:8b 46.1 34.7 25.7 17.9 64K R1+R2
lexi-8b 45.3 33.4 25.0 16.4 64K R1
qwen3-abl-nothink:8b 41.4 30.6 22.7 14.2 64K R1
qwen3-abliterated:8b 40.9 30.4 22.7 14.8 64K R1
granite3.3:8b 40.2 27.8 19.8 12.2 64K R1
deepseek-r1:8b 39.5 29.7 22.3 14.8 64K R1
qwen3:8b 39.4 30.3 22.5 15.4 64K R1+R2
qwen3:8b-nothink 39.7 28.6 21.2 14.8 64K R1
glm4:9b 37.0 23.3 15.5 9.2 64K R1+R2
gemma3:12b 28.4 27.5 26.3 24.2 64K R1+R2
mistral-nemo:12b 31.8 24.7 19.1 13.1 64K R1+R2
qwen3-abliterated:14b 25.9 20.8 16.5 11.7 64K R1
qwen3-14b-16k 25.9 20.8 16.6 11.7 64K R1
qwen3:8b-q8_0 29.3 23.6 18.7 13.1 64K R1
qwen3:14b 25.2 20.7 16.7 11.0 → —⁵ 64K32K R1+R2
phi4:14b 26.0 19.5 ✂️ 16K 16K R1+R2
gemma2:9b 29.6 17.1 ✂️ 8K 8K³ R1+R2
deepseek-r1:14b 26.5 19.7 14.8 ⚠️ 2.3 → 0.1⁶ 32K R1+R3
★ MoE 35B-A3B 35.6 31.9 28.5 22.9 → 0.7⁶ 64K → **??**⁶ R1+R2+R3
★ qwen3.5:9b 31.1 29.4 27.0 23.4 64K R2
🏆 gemma4-26b-q3 35.2 31.1 27.7 TIMEOUT 48K R4

🏆 gemma4-26b-q3 tested in R4 (dedicated run, 6 context sizes 4K–48K, SCP payload delivery). 40K=26.4, 48K=25.0 tok/s. 65K allocation times out.

✂️ = silently truncated to native limit (prompt_eval_count flat across context sizes). ³ = gemma2:9b truncates at 8K native. ⁵ = R1 measured 64K OK; R2 (sequential batch) failed — likely VRAM contention between models. Individual values shown as "old → new".

64K regression (Qwen MoE + deepseek): Both models showed 10–30× speed reduction at 64K between the initial round and a later retest (see table below). 4K–32K data is consistent across rounds (±1 tok/s). The retest was isolated (full unload + 15s sleep between each context size), ruling out VRAM contention. The system had 19h uptime at retest time. Likely UMA memory fragmentation — confirmed by §4.10 where fresh-reboot Ollama achieved 28.7 tok/s at 64K.

64K regression detail — Qwen MoE and deepseek-r1:14b:

Model Metric Initial (R1) Retest (R3) Change
MoE 35B-A3B gen tok/s 22.9 0.7 −97%
MoE 35B-A3B prefill tok/s 135 58 −57%
MoE 35B-A3B TTFT (s) 302 596 +97%
MoE 35B-A3B prompt tokens 38,348 30,614 −20%
MoE 35B-A3B swap delta +328 MB
deepseek-r1:14b gen tok/s 2.3 0.1 −96%
deepseek-r1:14b TTFT (s) 13,084 3.6 hours
deepseek-r1:14b prompt tokens 30,609

B5.3 Context ceiling grid

Context ceiling heatmap

Ceiling Models
64K 22 models (69%) — including Coder-30B, Q2_K, qwen3:14b (initial round data)
64K (degraded) 2 models (MoE, deepseek-r1:14b) — 64K works but with severe speed regression in later retest (likely UMA fragmentation)⁶
48K 1 model (★ gemma4-26b-q3) — verified 4K–48K filled, 25.0 tok/s at 48K, no truncation. 65K times out
32K 3 models (qwen2.5:3b¹, qwen2.5:7b, ★ MoE practical ceiling)
16K 1 model (phi4:14b³)
8K 1 model (gemma2:9b³)
Broken 2 models (qwen2.5-coder:7b pec=0, qwen3.5-27b too large)

¹ qwen2.5:3b ceiling from extended benchmark. ³ phi4:14b and gemma2:9b silently truncate — actual filled ceilings are 16K and 8K respectively. ⁶ MoE achieved 22.9 tok/s @64K initially but only 0.7 tok/s on retest after extended uptime — likely UMA fragmentation (see B5.2). The practical ceiling for time-critical workloads is 32K (28.5 tok/s, stable across all rounds). ⁹ gemma4-26b-q3 verified 4K–48K filled (25.0 tok/s at 48K, TTFT 190.9s); 65K allocation times out.

B5.4 Prefill rate scaling

Model 4K 16K 32K 48K 64K (initial) 64K (after uptime)
★ MoE 35B-A3B 239 215 182 135 58 ⚠️
🏆 gemma4-26b-q3 270 219 177 148 TIMEOUT
★ qwen3.5:9b 227 206 182 145
phi4-mini 452 289 194 117
qwen3:8b 225 158 111 71
qwen3:14b 125 93 68
deepseek-r1:14b 121 91 69 2.4 ⚠️

Prefill rate (tok/s) at 80% filled context. gemma4-26b-q3 achieves 270 tok/s at 4K (highest among production models), degrading gracefully to 148 tok/s at 48K. Both MoE 35B-A3B and qwen3.5:9b converge to ~230 tok/s prefill at 4K. At 64K, the MoE shows a 2.3× prefill regression between the initial round (135 tok/s) and retest (58 tok/s) — same system, same Ollama version. deepseek-r1:14b prefill collapsed to 2.4 tok/s at 64K on retest (from ~40 tok/s estimated initially). See ⁶ regression note in B5.2.

B5.5 TTFT at filled context (Phase 3 core, run 1)

Model 4K 16K 32K 48K 64K
★ MoE 35B-A3B 26s 63s 126s 302s
🏆 gemma4-26b-q3 11s 43s 107s 191s TIMEOUT
★ qwen3.5:9b 117s¹ 57s 116s 279s
phi4-mini 11s 37s 105s
qwen3:14b 30s 115s 287s

¹ Elevated 4K TTFT includes model load time (model was not loaded prior to this test in the sequence). Four of six Phase 3 core models shown (qwen3:8b and mistral-nemo:12b omitted for brevity).

For interactive chat, the practical ceiling is 16K–32K filled (1–2 min TTFT). Above 32K, TTFT exceeds 2 minutes — acceptable only for batch.


B6. Long-Context Quality

What this tests and why it matters: B5 measures speed at filled context — can the model still generate tokens when the KV cache is full? B6 measures accuracy — can the model still use what's in that context?

Real workloads (code review on a large diff, summarising a log dump, correlating sensor data) require the model to (1) find a specific fact buried in thousands of tokens, (2) link multiple facts that are far apart, and (3) notice when two pieces of information contradict each other. If context quality degrades before context speed does, the extra context window is useless.

B6.1 is the baseline: plant a known fact and ask for it back — pure retrieval. B6.2 raises the bar: the answer requires chaining 3 scattered facts through arithmetic, or spotting a contradiction between two studies separated by thousands of filler tokens. These are the operations that break first when a model's effective attention window is shorter than its advertised context length.

B6.1 Embedded fact retrieval (16K) — 100% pass

Three unique facts embedded at 25%, 50%, 75% positions in 16K filled context:

Model Early (25%) Middle (50%) Late (75%) Total
★ gemma4-26b-q3 2/2 2/2 2/2 6/6
★ MoE 35B-A3B 2/2 2/2 2/2 6/6
★ qwen3.5:9b 2/2 2/2 2/2 6/6
phi4-mini 2/2 2/2 2/2 6/6
Total 24/24 (100%)

B6.2 Multi-hop reasoning & long-range synthesis (16K + 32K)

Four tasks at 16K and 32K filled context (80% fill, 5 diverse text domains). Facts embedded at known positions. Scoring: deterministic string-containment. Two independent runs; full prompts, responses, and scoring saved in benchmarks/results-longctx/.

Four test types:

  • multihop_budget — 3 facts → $4.2M × 60% × 50% = $1.26M
  • multihop_population — 3 facts → 840K × 35% × 20% = 58,800
  • synthesis_contradictions — identify 2 contradicting ocean temperature studies
  • synthesis_timeline — order 3 dated biotech events chronologically

Per-model results (run 1 / run 2):

Model 16K (R1/R2) 32K (R1/R2) Combined
★ gemma4-26b-q3 3/4 / 3/4 2/4 / 3/4 11/16
★ MoE 35B-A3B 3/4 / 2/4 3/4 / 2/4 10/16
qwen3.5:9b 2/4 / 3/4 3/4 / 2/4 10/16
phi4-mini 1/4 / 3/4 2/4 / 2/4 8/16

Per-task breakdown (64 trials: 4 models × 2 contexts × 2 runs):

Task Combined Pattern
multihop_budget 1/16 Near-universal fail — models write "$1,260,000" but check expects "1.26"
multihop_population 7/16 Variable — fact linkage sometimes missed at 32K
synthesis_contradictions 15/16 Strong — contradiction detection reliable across runs
synthesis_timeline 16/16 Universal pass — temporal ordering easiest task

Long-context quality heatmap

Key insight: Synthesis tasks are substantially more reliable than multi-hop arithmetic (31/32 vs 8/32 across both runs). gemma4-26b-q3 leads at 11/16 (perfect on all synthesis tasks), followed by MoE 35B and qwen3.5:9b tied at 10/16, phi4-mini at 8/16. Results vary between runs (LLM sampling variance), but task-level patterns are consistent.


B7. Cold-Start Timing

Model Run 1 Run 2 Run 3 Median Load time
★ gemma4-26b-q3 19.1s 17.7s 17.6s 17.7s 16.7s (~690 MB/s)
★ MoE 35B-A3B 18.0s 18.0s 17.5s 18.0s 16.2s (~660 MB/s)
★ qwen3.5:9b 11.9s 6.8s 7.0s 7.0s 5.6s (~1.1 GB/s)

Run 1 of qwen3.5:9b is ~70% slower than Run 2/3 because the GGUF file was not yet in the Linux page cache. Subsequent loads read from cached RAM pages. The MoE shows no gap because its GGUF was already cached from prior tests. Page cache was not dropped between runs (see B1.1).

With OLLAMA_KEEP_ALIVE=30m, cold start occurs only after 30 minutes idle. Warm TTFT: 0.3–1.7s.

Signal chat latency profile (gemma4-26b-q3, primary):

State TTFT Gen speed
Warm, short prompt (<1K) 0.3–1.7s 39.0 tok/s
Warm, medium prompt (~3K) ~15s 39.0 tok/s
Cold start (after 30 min) ~17.7s 39.0 tok/s
16K filled context ~51s 32 tok/s
32K filled context ~125s 28 tok/s
48K filled context (ceiling) ~191s
64K filled context (MoE fallback) ~302s 22.9 tok/s

B8. Quantization Impact

Model quant Gen tok/s Prefill VRAM @4K Swap Notes
qwen3:8b Q4_K_M 43.1 192.7 5.1 GiB 510 MB Standard
qwen3:8b Q8_0 31.2 237.2 8.5 GiB 1047 MB 28% slower, 67% more VRAM

Q4_K_M + Q4_0 KV cache is the sweet spot for this hardware — the 28% speed loss from Q8_0 is not worth the marginal precision gain for production tasks.

Why swap increases: The BC-250 has only ~14 GiB usable system RAM (kernel and firmware reserve ~2 GiB of the 16 GiB GDDR6). On this UMA system, GPU allocations come from the same physical pool. Even Q4_K_M (5.1 GiB model) shows 510 MB swap — the OS swaps background processes and page cache to make room. At Q8_0 (8.5 GiB), the larger model leaves less headroom for everything else, doubling swap pressure.


B9. Image Generation Benchmarks

sd.cpp, Vulkan GFX1013. Ollama stopped during image gen tests. All at 512×512 with same prompt and seed 42.

B9.1 Head-to-head comparison

Model Time @512² Steps VRAM Encoder
SD-Turbo 27s 4 2 GB built-in
FLUX.2-klein-4B 37s 4 6 GB Qwen3-4B
Chroma flash 67s 4 8.4 GB T5-XXL
FLUX.2-klein-9B 67s 4 11.8 GB Qwen3-8B
SD3.5-medium 102s 28 6 GB CLIP+T5
FLUX.1-schnell 107s 4 10 GB CLIP+T5
FLUX.1-kontext-dev 132s 20 10 GB CLIP+T5
FLUX.1-dev 167s 20 10 GB CLIP+T5

★ = production default (highest tested quality at practical speed)

Image generation comparison

B9.2 Resolution scaling — FLUX.2-klein

FLUX.2-klein-9B (production):

Resolution Steps Time s/step
512×512 4 67s 16.8
768×768 4 97s 24.2
1024×1024 4 147s 36.8
512×512 8 ❌ OOM

FLUX.2-klein-4B (fast alternative):

Resolution Steps Time s/step
512×512 4 37s 9.2
768×768 4 52s 13.0
1024×1024 4 82s 20.5
512×512 8 42s 5.2
1024×1024 8 122s 15.2

B9.3 Video & Upscaling

Task Model Details Time
Video WAN 2.1 T2V 1.3B Q4_0 480×320, 17 frames, 50 steps ~38 min
Upscale 4× ESRGAN (tile 192) 512² → 2048² 22s
Upscale 16× ESRGAN (128×2 passes) 512² → 8192² (67 MP) 4:50

B10. Model Recommendations

This is the single authoritative recommendation table for the BC-250. Every number below is sourced from the benchmark appendix; provenance footnotes follow.

Use Case Model Gen tok/s Filled Ctx Quality Why
🏆 Primary chat gemma4-26b-q3 39.0 48K verified 100% Largest 100% GPU-offloaded model (13.5 GiB); 1238 tok/s prefill; 128-expert MoE with ~3.8B active. 35→25 tok/s (4K→48K), no truncation. Also powers all 25+ automated scripts
🏆 Fallback (>40K) qwen3.5-35b-a3b-iq2m 37.5 32K practical 93% Largest knowledge capacity that fits 16 GB UMA; fast due to MoE (only 3B active). 64K works but with severe speed regression (see B5.2 note ⁶)
🏆 Vision / long ctx qwen3.5:9b 31.7 64K 100% Multimodal, most resilient context scaling (−13% at 32K, −25% at 64K)
Fast + lightweight phi4-mini 86.1 64K 93% Fastest model passing basic quality checks; only 2.5 GiB VRAM
Reasoning deepseek-r1:14b 28.7 32K 100% Perfect quality score; chain-of-thought
Speed-critical llama3.2:3b 102.2 64K 93% Fastest tested; good enough for simple tasks
Image gen FLUX.2-klein-9B 67s @512² ★ preferred 4-step, Qwen3-8B encoder; best visual result in side-by-side tests (B9)

Gen tok/s = Phase 2 median at 4K context where available (B3); gemma4-26b-q3 from §4.9 Ollama comparison (not in Phase 2); Phase 1 single-run for deepseek-r1:14b (B3). Filled Ctx = verified ceiling with 80% real-token fill (§4.5, B5.3). MoE 32K practical ceiling — 64K achieved 22.9 tok/s initially but only 0.7 tok/s on retest after extended uptime (likely UMA fragmentation, see B5.2 note ⁶). phi4-mini 64K verified via extended benchmark (§4.5: 20.3 tok/s). llama3.2:3b 64K verified via full sweep (B5.2: 23.3 tok/s). Quality = 5 tasks × 3 runs, 32 models (B4). qwen3.5:9b −25% from B5.1 context degradation analysis.

Why MoE likely wins on this hardware (hypothesis): The BC-250 has no tensor cores / matrix accelerators — all compute runs through scalar ALUs on 24 shader CUs. A 35B MoE with 3B active parameters does fewer multiplications per token than a 14B dense model, despite storing more knowledge. Result: 37.5 tok/s (35B MoE) vs 26.8 tok/s (dense 14B) with 93% vs 100% quality. However, this comparison confounds architecture (MoE vs dense), model family (Qwen3.5 vs Qwen3), and quantization (IQ2_M vs Q4_K_M). An isolated test would require same-family, same-quant MoE vs dense models — none were available at time of testing.


PART V — Reference

9. Repository Structure

▸ Full tree
bc250/
├── README.md                       ← you are here
├── netscan/                        → /opt/netscan/
│   ├── queue-runner.py             # v7 — continuous loop + Signal chat (330 jobs)
│   ├── career-scan.py              # Two-phase career scanner
│   ├── career-think.py             # Per-company career analysis
│   ├── salary-tracker.py           # Salary intelligence
│   ├── company-intel.py            # Company deep-dive
│   ├── company-think.py            # Per-entity company analysis
│   ├── patent-watch.py             # Patent monitor
│   ├── event-scout.py              # Event tracker
│   ├── city-watch.py               # SkyscraperCity local construction monitor
│   ├── leak-monitor.py             # CTI: 11 OSINT sources + Ahmia dark web
│   ├── ha-journal.py               # Home Assistant journal
│   ├── ha-correlate.py             # HA cross-sensor correlation
│   ├── ha-observe.py               # Quick HA queries
│   ├── csi-sensor-watch.py         # CSI camera sensor patent/news
│   ├── csi-think.py                # CSI camera domain analysis
│   ├── radio-scan.py               # Radio hobbyist forum tracker
│   ├── market-think.py             # Market sector analysis
│   ├── life-think.py               # Cross-domain life advisor
│   ├── system-think.py             # GPU/security/health system intelligence
│   ├── career-digest.py            # Weekly career digest → Signal (Sunday)
│   ├── daily-summary.py            # End-of-cycle Signal summary
│   ├── frost-guard.py              # Frost/freeze risk alerter
│   ├── repo-think.py               # LLM analysis of repo changes
│   ├── academic-watch.py           # Academic publication monitor
│   ├── news-watch.py               # Tech news aggregation + RSS feeds
│   ├── book-watch.py               # Book/publication tracker
│   ├── weather-watch.py            # Weather forecast + HA sensor correlation
│   ├── car-tracker.py              # GPS car tracker (SinoTrack API, trip/stop detection)
│   ├── bc250-extended-health.py    # System health assessment (services, data freshness, LLM quality)
│   ├── llm_sanitize.py             # LLM output sanitizer (thinking tags, JSON repair)
│   ├── generate-html.py            # Dashboard builder (6900+ lines, 29 main + 101 host pages)
│   ├── gpu-monitor.py              # GPU data collector
│   ├── idle-think.sh               # Research brain (8 task types)
│   ├── repo-watch.sh               # Upstream repo monitor
│   ├── lore-digest.sh              # Mailing list digests (8 feeds)
│   ├── bc250-health-check.sh       # Quick health check (systemd timer, triggers extended health)
│   ├── gpu-monitor.sh              # Per-minute GPU sampler
│   ├── scan.sh / enumerate.sh      # Network scanning
│   ├── vulnscan.sh                 # Weekly vulnerability scan
│   ├── presence.sh                 # Phone presence detection
│   ├── syslog.sh                   # System health logger
│   ├── watchdog.py                 # Network security checker
│   ├── report.sh                   # Morning report rebuild
│   ├── profile.json                # Public interests + Signal config
│   ├── profile-private.json        # Career context (gitignored)
│   ├── watchlist.json              # Auto-evolving interest tracker
│   ├── digest-feeds.json           # Feed URLs (8 mailing lists)
│   ├── repo-feeds.json             # Repository endpoints
│   └── sensor-watchlist.json       # CSI sensor tracking list
├── systemd/
│   ├── queue-runner.service        # v7 — continuous loop + Signal chat
│   ├── queue-runner-nightly.service # Nightly batch trigger
│   ├── queue-runner-nightly.timer
│   ├── signal-cli.service          # Standalone JSON-RPC daemon
│   ├── bc250-health.service        # Health check timer
│   ├── bc250-health.timer
│   ├── ollama.service
│   ├── ollama-watchdog.service     # Ollama restart watchdog
│   ├── ollama-watchdog.timer
│   ├── ollama-proxy.service        # LAN proxy for Ollama API
│   └── ollama.service.d/
│       └── override.conf           # Vulkan + memory settings
├── scripts/
│   └── ollama-proxy.py             # Reverse proxy (injects think:false for qwen3)
├── generate-and-send.sh            → /opt/stable-diffusion.cpp/ (legacy EXEC pattern, intercepted by queue-runner)
└── generate-and-send-worker.sh     → legacy async worker (unused in v7, kept for EXEC pattern match)

Deployment

Local → bc250
netscan/* /opt/netscan/
systemd/queue-runner.service /etc/systemd/system/queue-runner.service
systemd/signal-cli.service /etc/systemd/system/signal-cli.service
systemd/ollama.* /etc/systemd/system/ollama.*
generate-and-send*.sh /opt/stable-diffusion.cpp/
# Typical deploy workflow
scp netscan/queue-runner.py bc250:/tmp/
ssh bc250 'sudo cp /tmp/queue-runner.py /opt/netscan/ && sudo systemctl restart queue-runner'

10. Troubleshooting

▸ ROCm initialization appears in Ollama logs

On this deployment, Ollama attempted a ROCm path during startup, failed on GFX1013, and continued with Vulkan. No action is needed unless startup behavior changes on a newer software stack.

▸ Only 7.9 GiB GPU memory instead of 16 GiB

GTT tuning not applied. Check: cat /sys/module/ttm/parameters/pages_limit (should be 4194304). See §3.3.

▸ 14B model loads but inference returns HTTP 500

TTM pages_limit bottleneck. Fix: echo 4194304 | sudo tee /sys/module/ttm/parameters/pages_limit (see §3.3).

▸ Model loads on CPU instead of GPU

Check OLLAMA_VULKAN=1: sudo systemctl show ollama | grep Environment

▸ Context window OOM kills (the biggest gotcha on 16 GB)

Ollama allocates KV cache based on num_ctx. Many models default to 32K–40K context, which on a 14B Q4_K model means 14–16 GB just for the model — leaving nothing for the OS.

Symptoms: Ollama or queue-runner gets OOM-killed, Ollama journal shows 500 errors, dmesg shows oom-kill.

Root cause: The abliterated Qwen3 14B declares num_ctx 40960 → 16 GB total model memory.

Fix: Create a custom model with context baked in:

cat > /tmp/Modelfile.16k << 'EOF'
FROM huihui_ai/qwen3-abliterated:14b
PARAMETER num_ctx 16384
EOF
ollama create qwen3-14b-16k -f /tmp/Modelfile.16k

This drops memory from ~16 GB → ~11.1 GB. Alternatively, set OLLAMA_CONTEXT_LENGTH=65536 in the systemd override (see §3.4) — this is the production mechanism used in v7+.

▸ signal-cli not responding on port 8080

Check the service: systemctl status signal-cli. If it crashed, restart: sudo systemctl restart signal-cli. Verify JSON-RPC:

curl -s http://127.0.0.1:8080/api/v1/rpc \
  -d '{"jsonrpc":"2.0","method":"listAccounts","id":"1"}'
▸ zram competing with model for physical RAM

Fedora defaults to ~8 GB zram. zram compresses pages but stores them in physical RAM — directly competing with the model. On 16 GB systems running large models, disable or limit zram and use NVMe file swap instead:

sudo mkdir -p /etc/systemd/zram-generator.conf.d
echo -e '[zram0]\nzram-size = 2048' | sudo tee /etc/systemd/zram-generator.conf.d/small.conf
▸ Python cron scripts produce no output

Stdout is fully buffered under cron (no TTY). Add at script start:

sys.stdout.reconfigure(line_buffering=True)
sys.stderr.reconfigure(line_buffering=True)
▸ Signal delivery from signal-cli

Signal JSON-RPC API at http://127.0.0.1:8080/api/v1/rpc:

curl -X POST http://127.0.0.1:8080/api/v1/rpc \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"send","params":{
    "account":"+<BOT>","recipient":["+<YOU>"],
    "message":"test"
  },"id":"1"}'

11. Known Limitations

Issue Impact
Shared VRAM In this setup, image gen requires stopping Ollama (single 16 GB UMA pool). Bot offline ~1 min (FLUX.2-klein-4B) or ~2 min (FLUX.2-klein-9B).
MoE context limit With Q4_0 KV, MoE 35B-A3B allocates 256K context. Practical filled ceiling is 32K (28.5 tok/s, stable). 64K filled context showed 22.9 tok/s initially but only 0.7 tok/s on retest after extended uptime — likely UMA fragmentation (B5.2).
Signal latency Messages queue during job execution (typical job 2–15 min). Chat checked between every job.
sd-cli hangs on GFX1013 Vulkan cleanup bug → poll + kill workaround.
Cold start latency 30–60s after Ollama restart (model loading).
Chinese thinking leak Qwen3 occasionally outputs Chinese reasoning. Cosmetic.
FLUX.2-klein-9B 8-step OOM At 8 steps (vs default 4), the 9B model fails — likely compute graph exceeds VRAM. The 4B variant handles 8 steps fine.
Prefill rate degrades with context (dense models) qwen3:14b showed 128 tok/s at 1.3K → 70 tok/s at 10K tokens. MoE primary held ~127 tok/s across prompt sizes in testing.
Gen speed degrades with context fill (dense models) qwen3:14b showed 27 tok/s empty → 13 tok/s at 30K tokens. The MoE degrades too, but less steeply: 35.6 tok/s at 4K filled → 28.5 tok/s at 32K filled (−35%).
Speculative decoding not yet available Ollama 0.18 has no --draft-model. Dual-model loading evicts the draft model. May change in future Ollama versions.
TTS not currently feasible CPU-based TTS (Piper, Coqui) competes with GPU for the same 16 GB UMA pool. No practical Vulkan-accelerated TTS path was identified for this deployment as of early 2026.

12. Software Versions

Pinned versions as of March 2026. All components built/installed on Fedora 43.

Component Version Notes
OS Fedora 43, kernel 6.18.9 Headless, performance governor
Ollama 0.18.0 Vulkan backend, OLLAMA_FLASH_ATTENTION=1
Mesa / RADV 25.3.4 Vulkan 1.4.328, RADV GFX1013
stable-diffusion.cpp master-504 (636d3cb) Built with -DSD_VULKAN=ON. Reverted from master-525 due to FLUX.2-klein tensor naming regression.
whisper.cpp v1.8.3-198 (30c5194c) Built with Vulkan, large-v3-turbo model
signal-cli 0.13.24 Native binary, JSON-RPC at :8080
Qwen3.5-35B-A3B IQ2_M (GGUF, ~11 GB) Primary MoE model, via unsloth
qwen3.5:9b Q4_K_M (GGUF, 6.1 GB) Vision + long context model
FLUX.2-klein-9B Q4_0 (GGUF, 5.3 GB) Image generation, via leejet
ggml-large-v3-turbo 1.6 GB Whisper model for audio transcription
ESRGAN RealESRGAN_x4plus (64 MB) 4× image upscaling
Python 3.13 queue-runner, netscan scripts

13. References

Hardware & Drivers

Resource URL
AMD BC-250 community docs (BIOS, setup) https://elektricm.github.io/amd-bc250-docs/
LLVM AMDGPU processor table (GFX1013) https://llvm.org/docs/AMDGPUUsage.html#processors
Mesa RADV Vulkan driver https://docs.mesa3d.org/drivers/radv.html
Linux TTM memory manager https://docs.kernel.org/gpu/drm-mm.html

LLM Inference

Resource URL
Ollama — local LLM runtime https://github.com/ollama/ollama
Qwen3.5 model family (Alibaba) https://huggingface.co/Qwen
Qwen3.5-35B-A3B GGUF (unsloth) https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF
Qwen3.5-9B (Ollama) https://ollama.com/library/qwen3.5
GGUF quantization format (llama.cpp) https://github.com/ggml-org/llama.cpp

Image & Video Generation

Resource URL
stable-diffusion.cpp (Vulkan) https://github.com/leejet/stable-diffusion.cpp
FLUX.2-klein-9B GGUF https://huggingface.co/leejet/FLUX.2-klein-9B-GGUF
FLUX.2-klein-4B GGUF https://huggingface.co/leejet/FLUX.2-klein-4B-GGUF
FLUX.1-Kontext-dev (image editing) https://huggingface.co/black-forest-labs/FLUX.1-Kontext-dev
Chroma (flash distilled) https://huggingface.co/leejet/Chroma-GGUF
WAN 2.1 T2V 1.3B (video generation) https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B
Real-ESRGAN (image upscaling) https://github.com/xinntao/Real-ESRGAN

Audio & Speech

Resource URL
whisper.cpp (Vulkan STT) https://github.com/ggml-org/whisper.cpp
Whisper GGML models (large-v3-turbo) https://huggingface.co/ggerganov/whisper.cpp

Messaging & Integration

Resource URL
signal-cli (Signal messenger CLI) https://github.com/AsamK/signal-cli
Signal Protocol https://signal.org/docs/

Appendix A — OpenClaw Archive

▸ Historical: OpenClaw gateway configuration (replaced in v7)

OpenClaw v2026.2.26 was used as the Signal ↔ Ollama gateway from project inception through queue-runner v6. It was a Node.js daemon that managed signal-cli as a child process, routed messages to the LLM, and provided an agent framework with tool dispatch.

Why it was replaced:

  • ~700 MB RSS on a 16 GB system (4.4% of total RAM)
  • 15+ second overhead per agent turn (system prompt injection, tool resolution)
  • Unreliable fallback chains caused "fetch failed" timeout cascades
  • Could not run scripts as direct subprocesses — everything went through the LLM agent
  • signal-cli children survived gateway OOM kills, holding port 8080 as orphans
  • 9.6K system prompt that couldn't be reduced below ~4K without breaking tools

What replaced it: See §5 for the current architecture.

A.1 Installation (historical)

sudo dnf install -y nodejs npm
sudo npm install -g openclaw@latest

openclaw onboard \
  --non-interactive --accept-risk --auth-choice skip \
  --install-daemon --skip-channels --skip-skills --skip-ui --skip-health \
  --daemon-runtime node --gateway-bind loopback

A.2 Model configuration (historical)

~/.openclaw/openclaw.json:

{
  "models": {
    "providers": {
      "ollama": {
        "baseUrl": "http://127.0.0.1:11434",
        "apiKey": "ollama-local",
        "api": "ollama",
        "models": [{
          "id": "qwen3-14b-16k",
          "name": "Qwen 3 14B (16K ctx)",
          "contextWindow": 16384,
          "maxTokens": 8192,
          "reasoning": true
        }]
      }
    }
  },
  "agents": {
    "defaults": {
      "model": {
        "primary": "ollama/qwen3-14b-16k",
        "fallbacks": ["ollama/huihui_ai/qwen3-abliterated:14b", "ollama/mistral-nemo:12b"]
      },
      "thinkingDefault": "high",
      "timeoutSeconds": 1800
    }
  }
}

A.3 Tool optimization (historical)

{
  "tools": {
    "profile": "coding",
    "alsoAllow": ["message", "group:messaging"],
    "deny": ["browser", "canvas", "nodes", "cron", "gateway"]
  },
  "skills": { "allowBundled": [] }
}

A.4 Agent identity (historical)

Personality lived in workspace markdown files (~/.openclaw/workspace/):

File Purpose Size
SOUL.md Core personality 1.0 KB
IDENTITY.md Name/emoji 550 B
USER.md Human info 1.7 KB
TOOLS.md Tool commands 2.1 KB
AGENTS.md Grounding rules 1.4 KB
WORKFLOW_AUTO.md Cron bypass rules 730 B

A.5 Signal channel (historical)

{
  "channels": {
    "signal": {
      "enabled": true,
      "account": "+<BOT_PHONE>",
      "cliPath": "/usr/local/bin/signal-cli",
      "dmPolicy": "pairing",
      "allowFrom": ["+<YOUR_PHONE>"],
      "sendReadReceipts": true,
      "textChunkLimit": 4000
    }
  }
}

A.6 Service management (historical)

systemctl --user status openclaw-gateway   # status
openclaw logs --follow                     # live logs
openclaw doctor                            # diagnostics
openclaw channels status --probe           # signal health

The gateway service (openclaw-gateway.service) ran as a user-level systemd unit. It has been disabled and masked:

systemctl --user disable --now openclaw-gateway
systemctl --user mask openclaw-gateway

Artur Andrzejczak · andrzejczak.artur@gmail.com · March 2026

Development assisted by Claude Opus 4.6.

Code: AGPL-3.0 · Docs: CC BY-SA 4.0

About

AMD BC-250 (PS5 APU) setup guide — Ollama + Vulkan inference, poor man's AI assistant via Signal, stable-diffusion.cpp image generation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages