A running set of technical posts on efficient transformer systems, scientific machine learning, and implementation details.
From Encoder to Decoder: Extending FLARE to Memory-Efficient Causal Attention
Running notes — last updated 2026-02-22. This is a living document, not a polished article; I update it frequently as my understanding develops. Motivation FLARE was originally developed as an encoder-style global mixing primitive: learned latent queries gather information from many tokens, then scatter it back. The decoder setting is harder because causality changes algorithmic dependencies, numerical stability constraints, and what efficiency means in training versus inference. This post summarizes a practical path to causal FLARE for long-context language modeling. See also the dissertation proposal talk for broader context. ...