Sam Neill Furr

I use this site to document my projects. Feel free to get in touch for more details. Find me on linkedin or samneillfurr at gmail dot com.

Replacing Linear Attention Projections With Learned Rotations
Grant Sanderson (3Blue1Brown) posted a video about taking the logarithm of an image. Conformal maps in complex space. Squares stay squares. Geometry bends but doesn’t shear. I watched it and thought: what if attention worked like that?

Transformers project embeddings through big learned matrices W_Q and W_K. For d_head=64, that’s 4,096 parameters per projection per head. But conformal maps preserve angular relationships between vectors and are parameterized by a tiny number of values.

If attention heads are just learning geometric transforms, most of those 4,096 parameters are wasted.

That thought pulled me through Clifford algebras, Möbius transforms, quaternion poles, and N-dimensional conformal groups. Most of the theory was wrong. But something useful fell out.

I’m calling it Affine Möbius Attention.

The Möbius decomposition

A Möbius transform is the most general conformal map of N-dimensional space. It decomposes into:
1. Translation. Shift the vector. (d_head params)
2. Rotation. Rotate in paired orthogonal planes. (d_head/2 angle params)
3. Inversion. Map z → z/|z|². (0 params, creates a pole)
4. Scaling. Uniform scale. (1 param)
Total: ~1.5 × d_head + 1 parameters. For d_head=64 that’s 97 params versus 4,096 for linear. A 42x reduction.

The inversion is the mathematically interesting part. Theory says the pole is harmless in high dimensions. The theory is wrong. But I’m getting ahead of myself.

GPT-2’s heads aren’t geometric

Before writing training code I pulled the QK circuit matrices from every head in GPT-2 Small. Polar decomposition. Measured conformality.

Mean conformality: 0.44 out of 1.0. Singular value ratios of 100-5000x. Zero heads above 0.8.

GPT-2’s heads amplify some dimensions thousands of times more than others. They use the full expressivity of their linear projections. Not conformal. Not close.

The foundational hypothesis was dead. I built it anyway.

Inversion destroys everything

Standard attention baseline: 2.41 BPB. Full Möbius with inversion: 3.39 BPB. Forty percent worse. Every variant with inversion failed.

Remove the inversion, keep rotation + translation + scale? Loss came right down to baseline.

The thing that makes Möbius transforms mathematically special is the thing that makes them useless for attention. “Measure zero” doesn’t mean “gradient-stable.” I dropped the inversion and never looked back.

Fewer parameters, better generalization

With dropout, Affine Möbius hit 2.250 val BPB versus 2.316 for standard attention. 95.4% fewer Q/K parameters. Better loss.

Model Q/K Params Val BPB
Standard + dropout 67,584 2.316
Affine Möbius + dropout 3,136 2.250

Learned rotation planes? Worse than fixed adjacent pairs. Hybrid (linear early layers + Möbius later)? Worse than pure Möbius everywhere.

The constraint is structural regularization. The model can’t memorize token co-occurrences through arbitrary projections. It has to learn coherent geometric transforms. Those generalize better.

The actual transform:
```
cos_a, sin_a = cos(angles), sin(angles)
x1, x2 = x[..., :half], x[..., half:]
x_rot = cat([x1 * cos_a - x2 * sin_a,
             x1 * sin_a + x2 * cos_a], dim=-1)
Q = exp(log_scale) * x_rot + translation
```
Per head: d_head/2 angles, d_head translations, 1 scale. For GQA with 8 query heads and 4 KV heads: 1,164 params/layer versus 393,216 for linear Q/K.

H100 scale

Integrated into the Parameter Golf SOTA stack. 512d, GQA, Flash Attention 3, GPTQ, 8xH100, 10-minute window. The parameter savings let me scale from 11 to 13 layers within the 16MB budget.

Config Steps Sliding BPB Artifact
SOTA (11L, linear Q/K) ~6,900 1.1147 15.86MB
13L Affine Möbius 5,344 1.1463 15.95MB

Gap: 0.032 BPB. SOTA wins.

But the Möbius ops are 4.4% of CUDA time. The gap comes from layer count. Two extra layers cost ~1,500 training steps over 10 minutes. Better per-step learning, fewer steps. H100 tensor cores are so fast at matmuls that O(d) vs O(d²) doesn’t matter. The matmuls aren’t the bottleneck.

Quantization surface area

Affine Möbius has far fewer parameters exposed to quantization. The rotation params themselves are tiny (97 per head) and stored as float passthrough. The bulk of quantization pressure falls on V, output, and MLP weights, which are identical between architectures.

In early 1xH100 experiments with undertrained models, the standard baseline showed much worse GPTQ degradation. But this turned out to be an artifact: both models’ Hessians were degenerate from insufficient training, and the baseline simply had more weight matrices hitting the quantization fallback path. On the full 8xH100 runs, both architectures quantize cleanly:
- Affine Möbius 13L: 1.1676 → 1.1719 post-quant (tiny degradation)
- SOTA: comparable post-quant behavior
Whether fewer Q/K parameters translates to a real quantization advantage at scale is an open question. The reduced quantization surface area is structurally real. But I don’t have the evidence to make a strong claim yet.

What this is

After all the Clifford algebra theory, what survived fits in a tweet:

Replace W_Q and W_K with learned rotations in fixed dimension pairs, plus translation and scale.

The affine subgroup of Möbius transforms. No inversion, no poles, no complex numbers. Just angles and offsets.

Closest existing work:
- RoPE: pairwise rotations, but position-dependent fixed angles after linear projections. Ours replaces the projections. Both compose naturally.
- MöbiusAttention (Halacheva et al. 2024): full Möbius in complex space per dimension pair (arXiv:2409.12175). We drop the inversion they kept and work in real space with cross-dimension coupling.
- qGOFT / BOFT: Givens rotations for fine-tuning. They modify existing projections. We replace them.
What if this scaled?

I don’t have the compute to test frontier scale. But the arithmetic is interesting.

A 128-layer model with 96 heads at d_head=128: standard Q/K projections use ~400M parameters. Affine Möbius: ~2.4M. That’s 397M freed parameters you could spend on more layers, wider MLPs, or larger embeddings.

During inference, Q/K projection weights get loaded from memory every token. Affine Möbius weights are ~170x smaller. Less memory traffic, lower latency, higher throughput.

In distributed training, cutting Q/K params by 99% makes attention layers nearly free in the allreduce communication graph. The training bottleneck shifts entirely to MLP and embedding layers.

The catch: my Phase 0 showed GPT-2 heads are highly non-conformal. If that holds at scale, some heads might need more expressivity. The fix could be Affine Möbius plus a small low-rank residual for heads that need it. Even Möbius + rank-4 residual would be dramatically smaller than full linear projections. There’s also an open question about per-head dimension slicing: each head sees a fixed 64-dim slice of the full model dimension. At wider models this constraint might bite.

Someone with a cluster should try this.

The path

3Blue1Brown video → complex holomorphic attention → Clifford algebras → Möbius transforms → Phase 0 reality check (GPT-2 isn’t conformal) → built it anyway → inversion fails → affine subgroup works → beats standard on small data → competitive at H100 scale.

Most of the math was unnecessary. But every wrong turn narrowed the search space. Sometimes the right answer is the simplest piece of a complicated idea.

Code: https://github.com/samfurr/affine_mobius_attenton

Model	Q/K Params	Val BPB
Standard + dropout	67,584	2.316
Affine Möbius + dropout	3,136	2.250

Config	Steps	Sliding BPB	Artifact
SOTA (11L, linear Q/K)	~6,900	1.1147	15.86MB
13L Affine Möbius	5,344	1.1463	15.95MB

FoveatedKV

https://github.com/samfurr/foveated_kv

I’ve been running local LLMs on my Mac a lot lately, and I kept hitting the same wall. The bottleneck is memory bandwidth. Every decode step has to read the entire KV cache from memory. At long context on an 8GB machine, that’s gigabytes moving per token. The model is mostly just sitting there waiting for data.

The usual fixes are to either evict tokens (fast, but you’re literally deleting information and breaking the softmax denominator) or quantize everything uniformly (which treats the answer to your question with the same precision as filler text). Both bugged me. Attention focuses. On any given step, a handful of tokens get nearly all the weight. So why store the unimportant ones at the same precision?

The VR headset thing

This got me thinking about foveated rendering. VR headsets figured this out a long time ago. Your eye can only resolve detail at the center of your gaze, maybe 2° of your visual field. Everything else is peripheral blur. So they render the center sharp and blur the rest. You save 50-70% of GPU work and nobody can tell the difference.

I applied the same idea to LLM attention:

VR Foveated Rendering	FoveatedKV
Eye position	Query vector
Sharp center of gaze	Near tier: top 10% of tokens at fp16
Blurry periphery	Far tier: bottom 90% at fp8 keys + INT4 values
Full-res frame buffer on disk	NVMe mmap archive of the fp16 originals
Eye tracker detects gaze shift	Metal kernel spike detection (falls out of softmax for free)
Re-render at new gaze point	C++ worker promotes exact fp16 into near tier

The important bit: every token still contributes to attention. The softmax denominator stays correct. Nothing gets evicted. The only approximation is quantization noise on tokens the model isn’t paying attention to, and that noise is bounded, non-accumulating, and recoverable.

For the far tier I use fp8 E4M3 for keys and INT4 for values. Multiple recent papers (KIVI, LeanKV, KV-AdaQuant, AsymKV) have independently found that keys need higher precision than values because key error gets amplified through exp() in softmax. My ablation confirms it. INT4 keys produce 34x more error than fp8.

Making it fast

The quantization strategy is established prior work. Where I spent most of my time was making it actually fast on Apple Silicon. I wrote a fused split-K Metal kernel that handles both precision tiers plus the decode buffer in a single GPU dispatch. The query gets pre-scaled by 1/√D at load time. fp8 decoding uses a 256-entry LUT in threadgroup memory, so one shared-memory read instead of 10+ ALU ops per element. Score-gated value loading skips INT4 dequant entirely when a token’s score falls below exp(−16) (basically zero contribution). Online softmax computes each exp() exactly once.

Spike detection falls right out of the softmax bookkeeping with no extra work. When the kernel notices a far-tier token scoring higher than near-tier tokens, a C++ background worker reads the exact fp16 original from an NVMe-backed mmap archive and swaps it into the near tier. Next kernel dispatch, it’s just an ordinary full-precision token. Raw spike rate is high (~95% per head-layer slot), but aggressive filtering (cooldown, dedup, budget cap, GQA dedup) brings the effective promotion rate down to about 3.1%.

In a multi-fact needle-in-haystack retrieval test, promotion recovered 24 additional facts out of 64, bringing accuracy from 50% all the way up to 88%. That matches the full fp16 baseline exactly. Pretty satisfying.

Results

Everything benchmarked on my M2 with 8GB. Three model families: Qwen2.5-0.5B, Qwen2.5-7B, and Llama-3.2-1B (all 4-bit weight quantized).

2.3x Faster 7B decode at 512 context on 8GB. Compressed cache reduces memory pressure
0.995+ Cosine fidelity vs exact fp16 attention, stable from 512 to 32K
2.0x Memory compression at 10% near tier, consistent across all context lengths
88% Multi-fact retrieval with promotion, matching full fp16 baseline

The really fun result was on the 7B model. On 8GB hardware, FoveatedKV is actually faster than standard because the compressed cache reduces memory pressure enough to make the whole pipeline flow better. Standard OOMs at around 3K context. Foveated keeps going.

CONTEXT	FP16 SDPA	FUSED KERNEL	SPEEDUP
1K	1.12 ms	0.94 ms	1.19x
4K	2.33 ms	1.46 ms	1.60x
8K	4.05 ms	2.22 ms	1.82x
16K	7.68 ms	3.73 ms	2.06x
32K	15.72 ms	6.81 ms	2.31x

Fused kernel vs Apple SDPA, 7B GQA shapes (H_kv=8, H_q=32, D=128). Break-even around 512 tokens.

What’s next

The benchmarks so far are all on 8GB Apple Silicon, which is the most constrained and most interesting target. The obvious next steps are a llama.cpp head-to-head, full-length LongBench runs on 16GB+ hardware, and scaling validation on bigger models. The promotion pipeline also opens up some interesting possibilities around speculative caching. Predicting which tokens will spike before they actually do.

The code, benchmarks, and a draft technical paper are all on GitHub. 73 tests, 8 benchmark suites, MIT licensed. Built with significant Claude assistance on the iteration and benchmarking side.

Stool Project

My sister is an artist. Her focus is mainly on oil painting and portraits, however she recently started using a hot knife to cut construction foam into different shapes to paint on. People seemed to really like her “shaped” artwork, however the process of manually cutting out the shapes each time is quite tedious work. This got me thinking if there were some easy way I could help her produce these shapes. I landed on the idea of some sort of CNC machine, but somehow much cheaper than most of the traditional types. I found a pretty novel design that the open source community had been iterating on, called the ‘Maslow’ CNC. It works by suspending the cutting part (in our case a Makita router) on two chains on an angled work surface, so the position of the cutter could be identified by chain length and work surface size, its actually a really elegant solution. I went ahead and ordered some parts, and built one in the basement of my parents’ house, which doubles as my sister’s art studio.

With the machine constructed and working great for cutting out foam, I had some growing ideas of what else the CNC could be used for. I got to work in Shapr3D designing a simple flat-pack stool that I could use to test tolerances and plywood cutting on the machine. After a few iterations, I now have side tables/stools for my living room, as well as for my sister’s art project.

I had some friends come to stay with us for a week, and they loved the stool design, so I promised I would mail them one. After a few months, our friends got married and I still hadn’t sent them a stool. I was working on an updated stool design, to really push the limits of my woodworking ability, but to also test out PCBWay’s CNC aluminum feature. With one inch thich leg pieces, compound curve stool seat, and inlaid aluminum pieces, this was definitely a project that took me longer than expected. I started out building a rough prototype using construction foam and 3D printed pieces as stand-ins for the aluminum and wood.

After a few months though, and tracking down some nice marine-grade mahogany plywood, it was finally assembled and looking great. I built a custom cardboard box with custom foam cutouts to keep it stable in shipping.

The finishing touch were my friends’ initials on the middle piece of Aluminum. When my friends first got it, they had no idea I had made it until I showed them some work in progress pictures! They were pretty impressed, and their cat loves the custom box and foam.

Next up, an easier to build design, that might also include more inlaid aluminum…
Sculpture Project (Scalable multiplayer platform)

A view into the test scene, where each color represents a connection from instances running in different GCP regions.

As a project that I might want to use myself for any sort of massively multiplayer game I would want to make, I composed an architecture for a multiplayer platform that is cheap to run and can scale out seamlessly to handle as many players as you can throw at it. With client-facing websocket nodes written in Rust that can seemlessly scale on GCP, and a backend made from sharded Redis cluster instances, there are no performance troubles with many connections from around the world. With optimizations for shifting origins on the client-side and using octal trees to manage positions on a greater scale, the potential play area can be as large as you could ever want it (universe-scale). With many more optimizations relating to tracking velocity and further time derivatives, this has been a fun project to tinker with in the off-hours. The entire project is run with docker containers, from the dev environment to production.
Home k8s/ceph Cluster

Cluster, sans case.

Cluster with its laser-cut delrin and walnut case, ready for the bookshelf.

To host my ever-growing collection of pet projects, home automation, media management, and site hosting, I designed and built a “highly available” (highly available enough for me) kubernetes/ceph cluster using 7 low-power arm64 nodes, each with their own redundant storage and redundant battery backup. The case for the system is made from laser-cut walnut, PVC, and standard threaded rod from McMaster-Carr. It automatically connects to the city WIFI and tunnels to a cloud server if it ever sees home connection down. Though completely overbuilt, it gives me the perfect home test-bench for networking and infrastructure.
Homebrew CPU

EagleCAD, 74XX logic design, electronics

I started dreaming of creating my own processor from low level logic circuits during grade school, however it wasn’t until recently that I gained the skills necessary to design and manufacture the printed circuit boards for it. This project consists of four 8-bit registers and a basic ALU, as well as a bus board so far. I plan on designing and creating the instruction register and related circuitry next.
How do you feel? (social platform)

Ruby on Rails, PostgresQL, Jquery, Sass (CSS), Javascript, HTML

This was a really fun project to work on. The number of iterations the design went through allowed me to experiment with art styles and my own artistic taste before landing on this design choice. Working with Ruby on Rails and PostgresQL also allowed me to flex my muscles with back-end design in order to vastly improve the function of the site.
Transistor Logic Gates

EagleCAD, circuit design

An earlier project of mine, but one where I probably learned the most. These small circuit boards are individual logic gates, built using single 2N222 transistors and 10K resistors. Learning the ins and outs of transistor datasheets and circuit design was fun and satisfying.
Go, Grandpa!

C#, Unity3D, Android, iOS

This is a mobile-first 2D sidescrolling game based on some of the flash games my friends and I used to play a long time ago when we first became fascinated with computers. With playful mechanics and simple art style, Go, Grandpa! was a fun game to develop as well as still play.
Speeks (socket-based web chat)

Node.JS, Express, Socket.IO, Jquery, Sass (CSS), Javascript, HTML

Working with Node.JS and Socket.IO for this project was extremely fun and satisfying. I always liked Javascript, but getting to use it for back-end work as well was great. Since Speeks.us uses sockets to chat between people, no actual chat data gets stored on the main server.
Mobile Design Study

Sketch, Adobe CC, Balsamiq

During my schooling, I took a set of mobile design courses that led me through the process of designing a modern, responsive mobile app from start to finish in a thorough and professional way. The iterations and process for this were based closely on the methods of professional mobile design shops.
Last.fm top 10 (data aggregator)

React, Chart.js, Sass (CSS), Javascript, HTML

After learning React as well as Redux, I spent a weekend creating this single-page React app that plots how the various artists in your Last.FM weekly top 10 shift around over time. Because this was done using React, it came together incredibly quickly and cleanly. I rather enjoy working with React.