Releases: go-webgpu/goffi
v0.5.0 — Windows ARM64 + FreeBSD support
What's New
Two new platforms -- goffi now supports 7 targets:
Windows ARM64 (Snapdragon X)
- Extended AAPCS64 ARM64 implementation to Windows via build tag changes
- Zero new assembly -- Windows ARM64 ABI is identical to Unix ARM64
runtime.cgocallworks on Windows without fakecgo- Tested on Samsung Galaxy Book 4 Edge (Snapdragon X Elite) by @SideFx
FreeBSD amd64
- Added
libc.so.7dynamic loading support - System V ABI identical to Linux -- same assembly code
- Requires
-gcflags="github.com/go-webgpu/goffi/internal/fakecgo=-std"forCGO_ENABLED=0builds
CI improvements
- New cross-compilation job validates all 7 platforms compile correctly
Platform Support (7 targets)
| Platform | Arch | ABI | Status |
|---|---|---|---|
| Windows | amd64 | Win64 | Production |
| Windows | arm64 | AAPCS64 | NEW -- tested on Snapdragon X |
| Linux | amd64 | System V | Production |
| Linux | arm64 | AAPCS64 | Production |
| macOS | amd64 | System V | Production |
| macOS | arm64 | AAPCS64 | Production |
| FreeBSD | amd64 | System V | NEW -- cross-compile verified |
Full Changelog
See CHANGELOG.md
v0.4.2
purego Compatibility Fix
Fixed
- Unix: duplicate symbol conflict with purego — added build tag
nofakecgoto resolve_cgo_initlinker collision when goffi and purego coexist withCGO_ENABLED=0(#22)
Workaround
When using both goffi and purego in the same binary:
CGO_ENABLED=0 go build -tags nofakecgo ./...This disables goffi's internal fakecgo, relying on purego's identical copy.
Also included
- Unit tests for
typesandinternal/arch/amd64packages - Test coverage increased from 75% to 89% (
-coverpkg=./...) - Dynamic Codecov badge in README
Closes #22
v0.4.1 — ABI Compliance Hotfix
ABI Compliance Hotfix
Full forward call path audit — 10 of 11 identified ABI gaps fixed.
Fixed
- Float32 argument encoding bug —
math.Float32bitsinstead of float64 widening, which corrupted XMM bit patterns - AMD64 Unix: stack spill for arguments 7+ — args beyond 6 GP registers now correctly pushed to stack before CALL
- ARM64 Unix: stack spill for arguments 9+ — args beyond 8 GP registers now correctly pushed to stack before BL
- AMD64 struct return 9-16 bytes — RAX+RDX register pair correctly assembled into output buffer
- AMD64 sret hidden pointer — structs >16B inject caller buffer as first arg (RDI), per System V ABI
- ARM64 HFA stack spill — HFA overflow correctly spills entire aggregate to stack per AAPCS64
- runtime.KeepAlive — added after each FFI call to prevent GC of argument pointers
Added
- Overflow detection —
ErrTooManyArgumentsfor >15 args - Regression tests:
TestWindowsStackArguments,TestWindowsStackArgumentsFileIO,TestWindowsStackArguments10Args,TestFloat32ArgEncoding,TestOverflowDetection,TestUnixStackSpill7Args
Removed
- Dead
callUnix64assembly experiment
Known Limitation (documented)
- Windows: float return from XMM0 —
syscall.SyscallNonly returns RAX, not XMM0
Verification
- Build: all 5 platforms cross-compile OK
- Tests: all PASS, coverage 89.6%
- Linter: 0 issues
Closes #19
v0.4.0 — crosscall2 Integration
What's New
crosscall2 integration — callbacks now work from C-library-created threads (Metal, wgpu-native).
Added
- crosscall2 integration for C-thread callback support (#16)
- Dispatchers route through
crosscall2 → runtime·load_g → runtime·cgocallback - Supports callbacks from arbitrary C threads
callbackWrap_callclosure for ABIInternal fn ptr from assemblygo_asm.hconstants forcallbackArgsstruct offsets
- Dispatchers route through
Fixed
- fakecgo trampoline register bugs (synced with purego v0.10.0)
- ARM64: R26→R9, R2→R9, threadentry callee-save/restore
- AMD64: DX→R11, CX→R11, BX→R11, JMP tail calls, PUSH_REGS_HOST_TO_ABI0
Verification
- All CI checks pass (Linux, Windows, macOS)
- 89.6% test coverage
- 0 linter issues
- 5-platform cross-compile verified
Full Changelog: v0.3.9...v0.4.0
v0.3.9 — ARM64 Callback Fixes
What's Changed
Fixed
- ARM64 callback trampoline rewrite — replaced
BL dispatcherwithMOVD $index, R12+B dispatcherpattern (matching Go runtime and purego conventions). Fixes LR corruption and entrySize mismatch for callbacks at index > 0. - Symbol rename — callback assembly symbols renamed to package-scoped (
·callbackTrampoline/·callbackDispatcher) to avoid linker collision with purego (#15)
Known Limitations
- crosscall2 bypass — callbacks invoked from C-library-created threads (e.g., Metal
addCompletedHandler:) may fail because goffi calls Go directly withoutcrosscall2 → runtime·cgocallback. Tracked in #16, planned for v0.4.0.
Upgrading
go get github.com/go-webgpu/goffi@v0.3.9If you use goffi callbacks on ARM64 (macOS Apple Silicon / Linux ARM64), this update is strongly recommended.
Full Changelog: v0.3.8...v0.3.9
v0.3.8: Enterprise-grade CGO_ENABLED=1 Error Handling
What's Changed
This release fixes confusing linker errors that occurred when building on Linux/macOS with a C compiler (gcc/clang) installed.
Fixed
- CGO_ENABLED=1 build error handling (gogpu/wgpu#43)
- Users now see a clear compile-time error:
undefined: GOFFI_REQUIRES_CGO_ENABLED_0 - Opening the source file shows full instructions in godoc comment
- Users now see a clear compile-time error:
Added
- Compile-time CGO detection with descriptive error identifier
- Requirements section in README.md with clear
CGO_ENABLED=0instructions - Runtime panic fallback with detailed fix instructions (defense in depth)
Changed
- Added
!cgobuild constraint to:ffi/dl_unix.go,ffi/dl_darwin.gointernal/dl/dl_stubs_unix.s,internal/dl/dl_wrappers_unix.sinternal/dl/dl_stubs_arm64.s,internal/dl/dl_wrappers_arm64.s
User Experience
Before (v0.3.7):
# github.com/go-webgpu/goffi/ffi
.../dl_unix.go:54:20: undefined: dl.Dlopen
Confusing - no indication of how to fix
After (v0.3.8):
# github.com/go-webgpu/goffi/ffi
ffi/cgo_unsupported.go:28:9: undefined: GOFFI_REQUIRES_CGO_ENABLED_0
Clear - identifier name tells user exactly what's needed
Quick Fix
CGO_ENABLED=0 go build ./...Or set permanently:
go env -w CGO_ENABLED=0Full Changelog: v0.3.7...v0.3.8
v0.3.7 - ARM64 Darwin Comprehensive Support
ARM64 Darwin Comprehensive Support
This release adds comprehensive ARM64 darwin (Apple Silicon) support, tested on M3 Pro.
Added
-
ARM64 Darwin comprehensive support (PR #9 by @ppoage)
- Tested on Apple Silicon M3 Pro (64 ns/op benchmark)
- Nested struct handling via
placeStructRegisters() - Mixed int/float struct support via
countStructRegUsage() ensureStructLayout()for auto-computing size/alignment- Assembly shim (
abi_capture_test.s) for ABI verification - Comprehensive darwin ObjC tests (747 lines)
- Struct argument tests (537 lines)
-
r2 (X1) return for 9-16 byte struct returns
Call8Floatnow returns both X0 and X1- Fixes struct returns between 9-16 bytes on ARM64
-
uint64 bit patterns for float registers
- Cleaner handling of mixed float32/float64 arguments
Fixed
- BenchmarkGoffiStringOutput segfault on darwin
- Pointer argument now correctly passed as
unsafe.Pointer(&strPtr)
- Pointer argument now correctly passed as
Contributors
- @ppoage - ARM64 Darwin fixes, ObjC tests, assembly shim
Full Changelog: v0.3.6...v0.3.7
v0.3.6 - ARM64 HFA/Large Struct Return Fix
Critical Fixes for ARM64 (Apple Silicon M1/M2/M3/M4)
Fixed
-
ARM64 HFA (Homogeneous Floating-point Aggregate) returns
- NSRect (4 × float64) returned zeros on Apple Silicon
- Root cause: assembly only saved D0-D1, HFA needs D0-D3
- Solution: save all 4 float registers for HFA returns
-
ARM64 large struct return via X8 (sret)
- Non-HFA structs >16 bytes returned via implicit pointer in X8
- Root cause: X8 register never loaded before function call
- Solution: load rvalue pointer into X8 for sret calls
Added
ReturnHFA2,ReturnHFA3,ReturnHFA4return flag constantshandleHFAReturnfunction for processing HFA struct returns- Unit tests for ARM64 HFA classification
Technical Details
- AAPCS64: HFA structs with 1-4 same-type floats return in D0-D3
- AAPCS64: Large non-HFA structs (>16 bytes) return via hidden pointer in X8
- NSRect = CGRect = 4 × float64 = 32 bytes = HFA (returns in D0-D3)
Impact
- Fixes blank window issue on macOS ARM64 (GPU window size was 0×0)
- Fixes gogpu#24
Full Changelog: v0.3.5...v0.3.6
v0.3.5 - Windows Stack Arguments Fix
Fixed
- Windows stack arguments not implemented (Critical)
- Functions with >4 arguments caused
panic: stack arguments not implemented - Win64 ABI: first 4 args in registers (RCX/RDX/R8/R9), args 5+ on stack
- Solution: Use
syscall.SyscallNwith variadic args for unlimited argument support - Affected Vulkan functions:
vkCreateGraphicsPipelines(6 args),vkCmdBindVertexBuffers(5 args), etc.
- Functions with >4 arguments caused
Changed
- Simplified Windows FFI - removed intermediate syscall wrapper
- Removed:
internal/syscall/syscall_windows_amd64.go call_windows.gonow callssyscall.SyscallNdirectly withargs...- Cleaner code, fewer indirections
- Removed:
Technical Details
syscall.SyscallN(fn, args...)supports up to 15+ arguments- Handles both register (1-4) and stack (5+) arguments automatically
- Same approach used by purego for Windows FFI
Full Changelog
v0.3.4 - Windows Stack Overflow Fix
Fixed
- Windows stack overflow on Vulkan API calls (Critical)
callWin64assembly usedNOSPLIT, $32- prevented Go runtime stack growth- Solution: Replace with
syscall.SyscallN(Go runtime's asmstdcall mechanism) - Matches purego's proven approach for Windows FFI
Changed
- Windows FFI architecture refactored
- Removed:
internal/arch/amd64/call_windows.s - Added:
internal/syscall/syscall_windows_amd64.go - Uses Go runtime's built-in stack management
- Removed:
Technical Details
The custom Windows assembly used NOSPLIT directive which prevents Go runtime from growing the goroutine stack. When C functions (especially Vulkan/WebGPU APIs) require more stack space than the fixed 32 bytes, this caused STACK_OVERFLOW (Exception 0xc00000fd).
The fix uses syscall.SyscallN which internally leverages runtime.cgocall + asmstdcall, properly managing stack growth through Go runtime.