go-pherence

A portable Go inference toolkit for local and embedded AI. go-pherence runs transformer decoders, encoders, speech models, and experimental accelerator paths from a single Go codebase, with NVIDIA GPU execution, SIMD CPU kernels, MLX/GPTQ/BF16/F16/F32 weight support, Vulkan scaffolding, and embedded SoC backends. It builds as a single static Go binary whenever possible: no Python runtime, no mandatory native SDK, and no CGo dependency in the default hot path.

Current development focus: practical local inference across NVIDIA, CPU/SIMD, and embedded targets; native MTP/speculative decoding for Gemma4 and Qwen3.6; Whisper decoder acceleration; and SpacemiT K3/RISC-V IME2 experiments. See docs/mtp-speculative.md, docs/qwen36-mtp.md, docs/spacemit-ime2.md, and docs/backend-stack.md.

Why

go-pherence is intended to be a small, inspectable inference stack for running real models on the hardware people actually have: NVIDIA cards with tight VRAM budgets, Intel/AMD laptops, ARM SBCs, and emerging embedded accelerators. It treats model formats such as MLX affine quantization as inputs rather than as the product identity, and keeps backend ownership explicit so optimized paths can coexist with checked scalar fallbacks.

Quick start

# Download a small quantized Qwen model
mkdir -p models/qwen3-0.6b
for f in config.json model.safetensors tokenizer.json; do
  curl -L "https://huggingface.co/mlx-community/Qwen3-0.6B-4bit/resolve/main/$f" \
    -o "models/qwen3-0.6b/$f"
done

# CPU path: AVX2/NEON SIMD with scalar fallbacks
go run ./cmd/llm/llmgen -model models/qwen3-0.6b -tokens 50 -prompt "The meaning of life is"

# NVIDIA path: runtime-loaded PTX, zero CGo
go run ./cmd/llm/llmgen -gpu -model models/qwen3-0.6b -tokens 50 -prompt "The meaning of life is"

Current highlights

Backends: NVIDIA PTX runtime, AVX2/NEON SIMD CPU runtime, Vulkan scaffolding, and embedded accelerator experiments such as SpacemiT K3 IME2.
Weight formats: MLX affine 4-bit, GPTQ/Q4, BF16, F16, F32, and model-specific packed paths where useful.
Architectures: LLaMA-family, Qwen2/3/Qwen3Next, Qwen3 MoE, Gemma3/Gemma4, BERT/GTE encoders, Whisper large-v3 translated VTT pipeline, MiniCPM-V/O metadata/prompt/tensor-readiness scaffolding, and experimental 3D/vision runtime scaffolding for Hunyuan3D.
Hybrid placement: --gpu-layers N, compact LM-head placement, reusable GPU caches, and planner-driven windowing for models larger than available VRAM.
Embedded scenarios: low-allocation CPU execution, quantized kernels, RISC-V IME2 INT8 matmul work, and static-binary deployment goals.
Speculative/MTP work: Gemma4 assistant loader, packed 4-bit assistant execution, real prompt activation/KV smoke, and Qwen3.6 native-MTP diagnostics.
Validation: checked runtime APIs, import-boundary tests, malformed-input guards, and hardware-gated smoke tests.

Recommended MTP development target

The fastest local MTP target is the Gemma4 E4B pair:

models/gemma4-e4b-it-4bit
models/gemma4-e4b-mtp-drafter

It fits fully on the local RTX 3060 (42/42 layers resident, compact quantized LM head resident, ~5GiB VRAM free). A 207-token real-prompt MTP smoke prefills in about 9.25s, and the assistant drafter step is about 0.10s.

Example:

GOTMPDIR=$PWD/.gotmp go run ./cmd/llm/llmgen \
  -gpu -gpu-layers 0 \
  -model models/gemma4-e4b-it-4bit \
  -mtp-drafter models/gemma4-e4b-mtp-drafter \
  -mtp-smoke -mtp-real-prompt \
  -prompt "Hi"

-mtp-smoke is a runtime smoke path, not full speculative generation yet.

Common commands

# One-shot generation
go run ./cmd/llm/llmgen -model models/qwen3-0.6b -gpu -tokens 50 -prompt "Hello"

# Interactive chat
go run ./cmd/llm/llmchat -model models/gemma4-e2b-it-4bit -gpu -n 256

# OpenAI-compatible server
go run ./cmd/llm/llmserver -model models/gemma4-e2b-it-4bit -gpu -listen :8080

# Stock-weight speculative benchmark scaffold
go run ./cmd/llm/specbench -model models/smollm2-135m -prompt-file prompts.txt -tokens 16 -repeat 3

# Large-v3 translated WebVTT from audio (GPU-assisted, resumable)
go run ./cmd/audio/diarize-vtt -input meeting.m4a -output meeting.vtt -language es

# Gemma4 QAT+MTP default parity + GGUF quant oracle gate
make gemma4-mtp-parity

# Strict Gemma4 QAT+MTP selected-logit parity, with an exported llama.cpp fixture.
# This is intentionally red until selected verifier logits match llama.cpp 1:1.
GO_PHERENCE_GEMMA4_MTP_LLAMA_CPP_FIXTURE=tmp/gemma4-mtp-llamacpp-fixture.json \
  make gemma4-mtp-strict-parity

# Native pure Go/SIMD GGUF REAP/TurboQuant validation and benchmark
make gguf-inspect-qwen36-reap
make gguf-smoke-qwen36-reap
make gguf-validate-qwen36-reap
make gguf-bench-qwen36-reap
make gguf-check-qwen36-reap  # validation + benchmark
make gguf-ci-qwen36-reap     # focused build smoke + check

# DiffusionGemma block-diffusion scaffold/status/sparse-native workflow
make diffusiongemma-download-metadata
make diffusiongemma-download-plan-report # show 11-shard / 48.10 GiB payload and free-space plan
make diffusiongemma-inspect
make diffusiongemma-status-refresh
make diffusiongemma-ci-no-weights # safe aggregate validation without safetensor shards
make diffusiongemma-check-sparse-text DIFFUSIONGEMMA_MODEL=models/diffusiongemma-26B-A4B-it
make diffusiongemma-run-sparse-text DIFFUSIONGEMMA_MODEL=models/diffusiongemma-26B-A4B-it DIFFUSIONGEMMA_PROMPT='hi' DIFFUSIONGEMMA_MAX_NEW=8 DIFFUSIONGEMMA_CANVAS=8 DIFFUSIONGEMMA_DENOISE_STEPS=2 DIFFUSIONGEMMA_RUN_RESIDENCY_BUDGET_GIB=16 DIFFUSIONGEMMA_LM_HEAD_TOP_K=8
make diffusiongemma-ci-sparse-text-published DIFFUSIONGEMMA_MODEL=models/diffusiongemma-26B-A4B-it

The Qwen3.6 REAP GGUF CI target exercises the native go-pherence path end-to-end: GGUF inspect, one-token generation smoke, synthetic TurboQuant cache smoke, benchmark, required SIMD rotation readiness, static/runtime/cache/benchmark KV+scratch byte assertions, and aggregate compressed-cache counters. It does not depend on llama.cpp for runtime execution.

The DiffusionGemma targets now cover both the safe no-weight scaffold path and a full-checkpoint native sparse text path. make diffusiongemma-ci-no-weights validates metadata, tokenizer/chat scaffolds, mock denoising, status JSON, and reference-helper dry-runs without safetensor shards. With the 11-shard checkpoint downloaded, make diffusiongemma-check-sparse-text gates the validated native sparse text stack (text_sparse=true, sparse_topk_lm=true), and make diffusiongemma-run-sparse-text accepts arbitrary text prompts through the full 30-layer CPU/SIMD text stack plus sparse top-k LM head. Published 256-token canvas one/two-step sparse smokes are validated by make diffusiongemma-ci-sparse-text-published. Reference-complete DiffusionGemma remains gated by parity fixtures and full processor/vision integration.

See docs/commands.md for detailed command usage, GGUF REAP/TurboQuant validation, MTP smoke commands, Qwen3.6 native-MTP triage commands, Whisper VTT usage, and benchmark harnesses. See docs/whisper-diarize-vtt.md for the current Whisper implementation status and limitations.

Documentation map

Start here:

docs/README.md — full documentation index.
docs/supported-models.md — supported architectures, formats, and performance snapshot.
docs/commands.md — CLI usage and smoke/benchmark commands.
docs/backend-stack.md — NVIDIA, Vulkan, SIMD, BF16, and package ownership summary.
docs/mtp-speculative.md — Gemma4/Qwen3.6 MTP implementation notes.
docs/whisper-diarize-vtt.md — current Whisper translated VTT pipeline and performance notes.
docs/gemma4-31b-runbook.md — Gemma4 E4B/31B local run strategy and smoke results.
docs/qwen36-mtp.md — Qwen3.6 native-MTP checkpoint findings.
docs/ideogram4-support.md — Ideogram 4 FP8 native CPU/SIMD image-generation runtime, validation status, and current GPU limitation.
docs/minicpmv-support.md — OpenBMB MiniCPM-V/O config/processor/tokenizer/generation metadata, image/audio prompt planning, tensor readiness, capabilities, and remaining tensor execution work.
docs/minicpmv-runtime-roadmap.md — ordered path from MiniCPM-V/O scaffold coverage to full text/vision/resampler/audio generation.
docs/diffusiongemma-support.md — DiffusionGemma block-diffusion metadata, sparse native text path, references, controls, and implementation plan.
docs/diffusiongemma-status.md — current DiffusionGemma scaffold/sparse-native status, operation coverage, CI targets, and remaining reference-complete gaps.
docs/validation-gates.md — standard validation gates.
docs/validation-hardening.md — malformed-input and boundary-hardening summary.

SpacemiT K3 / RISC-V IME2 acceleration

go-pherence includes a pure Go backend for SpacemiT K3 SoC (MilkV Jupiter 2), using IME2 hardware for accelerated INT8 matrix multiply.

Performance

Benchmark	Result
Raw vmadot (4x8x4 MAC)	9.3ns / 128M ops/sec
Pre-packed GEMM (1024 cubed, 8 threads)	142 GOPS
Single matmul (2048x1024)	326us
End-to-end decode (Qwen3-0.6B)	14 tok/s (INT8 LM head)

Key findings

IME2 spec: spacemit-com/riscv-ime-extension-spec
GCC flag: -march=rv64gcv_xsmtvdotii
X100 cores are CPU 0-7; A100 efficiency cores are 8-15
TCM: mmap /dev/tcm gives 3MB SRAM
Q4_K repack is identity (no tile format change)
Full details: docs/spacemit-ime2.md

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3,247 Commits
backends		backends
cmd		cmd
docs		docs
gpu		gpu
half		half
internal		internal
loader		loader
logs		logs
model		model
models		models
prompts/ideogram4		prompts/ideogram4
research/aicpu-whisper		research/aicpu-whisper
runtime		runtime
scripts		scripts
tensor		tensor
testdata		testdata
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum
ime2run		ime2run
ime2test		ime2test
testi8i4		testi8i4
verifydot		verifydot

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

go-pherence

Why

Quick start

Current highlights

Recommended MTP development target

Common commands

Documentation map

SpacemiT K3 / RISC-V IME2 acceleration

Performance

Key findings

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

go-pherence

Why

Quick start

Current highlights

Recommended MTP development target

Common commands

Documentation map

SpacemiT K3 / RISC-V IME2 acceleration

Performance

Key findings

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages