Stories by Aquin Labs on Medium

Training

Aquin Labs — Sat, 23 May 2026 09:29:50 GMT

Real-time signal detection, behavioral before/after comparison, and per-layer feature diffs: everything the loss curve does not show you.

What the loss curve does not show

A fine-tuning run has structure beneath the loss curve: gradient dynamics that reveal how information flows through the network, attention heads that can collapse silently, feature activations that shift as the model rewrites internal representations to accommodate the training objective. Almost none of that structure is visible from loss alone.

The training inspect system surfaces it in real time. A step event stream feeding loss, learning rate, gradient norms, per-layer breakdown, dead layer list, and epoch index into a signal engine that runs on each step as it arrives. When the engine detects a gradient spike, a loss plateau, a collapsed attention head, or the onset of loss divergence, a signal fires immediately with the specific metric and the exact step.

When training completes, the dashboard computes a model diff and a per-layer SAE feature diff, showing not just how loss moved, but which behaviors changed and which internal representations were rewritten.

What gets streamed

The system is agnostic to training framework. It consumes a flat step event schema: step index, loss, learning rate, max gradient norm, per-layer grad norms as a record, dead layer list, epoch index. No nested objects, no optional deep structures.

The per-layer grad norm breakdown is what enables the dead layer and attention collapse detectors. Without it, the engine observes only aggregate gradient behavior. With it, it names the specific layer that has collapsed and tracks how long it has been dead.

The signal engine

Five detectors

The signal engine is a pure function that runs on each new step snapshot. It takes the full step history plus two persistent streak maps, one for non-attention layers and one for attention layers, and returns a signal if one fired, or null. The streak maps are the only stateful part: they persist across steps so that dead layer detection can track how many consecutive steps a layer has had near-zero gradient.

Priority and cooldown

loss divergence and gradient spikes are checked first because they indicate active instability that may warrant stopping the run. dead layer and attention collapse are checked next, naming the specific failed components. loss plateau is last. It frequently describes healthy convergence rather than a problem, and its priority reflects that.

A 30-step cooldown prevents the same signal type from re-emitting continuously. A gradient spike that resolves and re-occurs fires again after 30 steps, the second occurrence is a distinct event with its own context.

The model diff

Behavioral delta

Three behavioral scores describe how the fine-tune changed the model from the outside: consistency score, suppression score, and robustness score. These are the same metrics from the eval system, applied to the base-vs-fine-tuned comparison. The base model is the reference; the fine-tuned checkpoint is the subject. The difference is the behavioral delta the training objective produced.

The robustness score is the most informative signal for factual fine-tuning. A fine-tune intended to reinforce factual knowledge should produce higher robustness on those facts. A robustness drop on target facts after factual fine-tuning means the model learned a surface pattern rather than a grounded representation.

The SAE feature diff

Layer change density

Behavioral scores describe the model from the outside. The SAE feature diff describes what changed internally. For each layer, the diff reports how many features shifted activation between base and fine-tuned, the mean absolute activation delta, and the single feature with the highest delta.

Layer-level change density is the most informative aggregate. A fine-tune that changes 14 of 512 features at L8 and 2 of 512 at L4 is making a focused, deep rewrite. The top feature per layer is where mechanistic investigation should start. If L10’s top shifted feature is F501 (refusal / safety language) and the training data had no refusal content, that warrants investigation in the model inspector.

The regression tracker

A single model diff shows how one fine-tune changed behavior relative to base. The regression tracker extends this across runs: every time a model diff arrives, category scores are appended to a per-category history so behavior can be tracked across all completed runs in the session.

A category that regresses more than five percentage points on the latest run is flagged. Detection is relative to the immediately prior run, not to the base. A score can look healthy against the base model while trending negatively across iterations. The tracker catches that drift where the raw diff cannot.

Confidence calibration

A model’s stated confidence and its actual accuracy can diverge in ways invisible from loss alone. A fine-tune can lower loss while making the model systematically overconfident. ECE measures that gap directly: it bins outputs by stated confidence, computes accuracy within each bin, and reports the mean gap between the two.

The calibration panel runs this comparison between base and fine-tuned using the training dataset as the evaluation set. The reliability diagram shows both models’ accuracy-per-confidence-bin as bar pairs against a perfect-calibration diagonal. The per-topic ECE table breaks the aggregate down by category. Models trained on domain-specific data frequently improve ECE on the target domain while degrading it on adjacent topics that share surface patterns with the training examples.

The low-confidence row list surfaces inputs where the fine-tuned model assigns probability below a configured threshold. These rows are exportable directly as a labeled dataset for the next training iteration. The model’s own uncertainty becomes the selection criterion for the data that trains the next version.

Training as the start of the investigation

Each finding from the training run is an entry point into a deeper investigation, not a terminal result. A dead layer signal at L6 step 61 is most usefully followed up by opening the fine-tuned checkpoint in the Model Inspector, going directly to L6, and running the causal trace to confirm whether that layer still contributes to outputs.

A suppression score that rises from base to fine-tuned opens a data investigation: the training dataset can be opened in the Data Inspector and the toxicity and PII modules run against the columns most likely to produce hedging signal.

The SAE feature diff provides the entry point for mechanistic investigation. Once the features that shifted most are identified and at which layers, the Model Inspector can be navigated directly to those features, their benchmark scores checked, the logit lens run, and steering applied to confirm their role. The diff turns post-training inspection from an open-ended search into a targeted inquiry.

The calibration panel adds a third path out of the training run. Low-confidence rows exported as a labeled dataset for the next iteration. The regression tracker closes the loop in the other direction, confirming the next iteration did not trade one weakness for another. Together they make the training session the input to the next investigation rather than the end of one.

The Attribution System

Aquin Labs — Sat, 23 May 2026 08:41:24 GMT

Seven tools that answer two questions: how did the model produce this output, and is the output actually correct?

Tracing facts through a language model

When a language model answers “What is the capital of France?” with “Paris”, it is not looking anything up. Somewhere in 1.2 billion parameters, the association was encoded during training and is retrieved at inference time through a sequence of matrix multiplications. Two questions follow: where exactly does the retrieval happen, and once we know the mechanism, is the answer actually right?

The attribution system runs two pipelines in sequence on every output. The first traces the mechanism: which layers, features, and prompt tokens caused each response token. The second evaluates the result: whether claims are true, whether the framing leans in a direction, whether certain topics were quietly avoided. Neither is complete without the other.

The query

A single factual query run end-to-end through the full pipeline. The prompt is intentionally simple. Unambiguous causal structure makes each step’s output easier to read.

ROME-style causal mediation analysis is the entry point: each prompt token’s embedding is corrupted with scaled Gaussian noise, the forward pass is re-run, and the drop in the target token’s probability is measured. Averaging over multiple noise runs produces a causal score for every (prompt token, response token) pair.

Attribution

Token attribution scores

Three prompt tokens dominate: “capital”, “of”, and “France”. Together they carry nearly all the causal signal driving “Paris”. “What” contributes almost nothing. The model identifies the semantically load-bearing tokens and routes most of the causal work through them, not through the full sentence structure.

16 layers, one peak

Causal patching localizes the retrieval to a specific layer. For each layer in turn, the clean residual stream is restored while all other layers remain corrupted, and the recovery in the target token’s probability is measured. The result is a causal drop score per layer: which one, when restored alone, brings “Paris” back.

Layer 8 accounts for 87% of the causal signal. The France to capital to Paris association is stored in the MLP sublayers at the network’s midpoint. This is the key-value store pattern: the subject representation (“France”) functions as a lookup key, and the MLP writes the associated value (“Paris”) into the residual stream at that layer.

The logit lens: watching confidence build

The causal trace locates the retrieval site. The logit lens shows what the model is predicting at each layer as it gets there. After every transformer block, the final layer norm is applied and the residual stream is projected directly into vocabulary space as if the model had stopped at that layer and been forced to output a token.

Early layers produce generic tokens like “the” and “city” with no factual commitment. Around layer 5, “France” surfaces briefly as the subject representation assembles. By layer 8, “Paris” dominates at 78% and holds flat through layer 15. The two-step structure of the retrieval is directly visible: subject formation first, then fact lookup at the MLP.

SAE Features

Top active features

The query is passed through an SAE at layer 8 to extract the top activating features at each token position. For each active feature, causal ablation zeroes out its contribution to the residual stream and re-runs the forward pass, comparing output distributions to define its functional role.

The circuit attribution graph

The circuit attribution graph makes the feature bridge structure explicit as a directed bipartite visualization: prompt tokens on the left, SAE features in the middle, response tokens on the right. Edge weight encodes activation strength on the left side and causal ablation score on the right.

Hub features are the diagnostic signal. f13910 (capital/seat-of-government) receives signal from both “capital” and “of” in the prompt and feeds both “capital” and “Paris” in the response, acting simultaneously as a relational and a geographic feature. A hub at this position is the first candidate for any intervention targeting “Paris”.

What each feature does to the vocabulary

Each SAE feature is a direction in residual stream space. Its effect on the model’s output is read by projecting that direction through the unembedding matrix, the logit projection. For f13933, the top boosted token is “Paris” at +4.21 and all suppressed tokens are non-French European capitals. The feature is not merely “France-related”: it specifically routes the output toward French city names and away from other national capitals.

Feature neighborhoods in weight space

Features that are geometrically close in decoder weight space tend to fire in similar contexts and produce similar vocabulary effects. For f13933, the nearest neighbor at 91% similarity is f13007 (European nation names). The neighborhood also includes f7834 (country-capital associations) and f2901 (seat-of-power contexts). Any weight editing intervention should account for this neighborhood: editing one feature risks perturbing the others.

The feature space: a map of 16,384 directions

UMAP projects all SAE decoder directions into three-dimensional space, making the full geometric structure of the feature space navigable. Features that fire in similar contexts and produce similar vocabulary effects cluster together.

All five features active on this query fall inside or adjacent to the same cluster, a geopolitical reference region. The UMAP view is most useful as a pre-edit diagnostic: a tight cluster means an edit to one feature will likely affect the others, and the edit scope should be set accordingly.

Feature steering: intervening directly

Feature steering adds a scaled multiple of a feature’s decoder direction to the residual stream at layer 8 on every forward pass, amplifying or suppressing the feature without touching model weights. It is the fastest way to validate a feature’s causal role before committing to a permanent weight editing intervention. Steering is reversible, weight editing is not.

When steering confirms the feature’s role and the logit projection confirms its vocabulary signature, a ROME-style weight editing operation to correct a factual association becomes a targeted, well-scoped intervention rather than a parameter search.

Checking

The attribution pipeline explains how “Paris” was produced: layer 8, five specific features, three prompt tokens, a geopolitical cluster with a clear logit signature. That tells us nothing about whether the output is accurate, whether its framing is neutral, or whether relevant information was left out. The checking system runs automatically after every generation and produces three analyses in parallel.

Fact check: is it true?

Every distinct verifiable claim is extracted from the response and classified as supported, refuted, or unverifiable, with a one-sentence explanation and up to three sources. Live web search rather than retrieval augmentation matters here: a model may assert something accurate at training time that has since changed.

Bias detection: which direction does it lean?

Rather than applying a fixed set of axes to every response, bias dimensions are derived from the content. Two to four axes genuinely relevant to the specific prompt are scored from -1.0 to +1.0. A response about climate policy yields axes like “alarmist vs dismissive.” The axes shift with the content rather than being imposed on it.

Censor audit: what did it not say?

Fact check and bias detection work on what the model said. Censor audit works on what it did not. Given the prompt, 3 to 6 topic areas naturally relevant to the response are identified, then each is assessed: addressed directly (unfiltered), engaged with excessive caveats (softened), or avoided (suppressed).

The audit also attempts to classify the origin of suppression, weight-level (consistent avoidance across prompt framings) vs surface-level (instruction-following patch). This is a hypothesis to investigate, not a finding. Confirming it requires causal mediation analysis and feature steering on the specific deflection point.

Reading together

A model can pass every behavioral check and still encode a factual error that mechanistic analysis catches immediately. A clean causal trace does not guarantee a correct or unbiased output. The mechanism and the result are independent questions and both require an answer.

For teams deploying models in regulated or high-stakes contexts, this is the difference between knowing a model scored 90% on a benchmark and knowing why. Which answers it gets right for the right reasons, which it suppresses, where in the network to look when something is wrong, and how to correct it.

Evals

Aquin Labs — Fri, 22 May 2026 07:55:20 GMT

Three behavioral evals that go beyond accuracy, measuring whether a model answers consistently, what it quietly avoids, and where its knowledge runs out.

Benchmarks tell you what. Evals tell you why.

Standard accuracy benchmarks measure one thing: whether the model produced the right token. They say nothing about whether it does so reliably across phrasings, whether it systematically avoids certain topics, or whether its confident outputs are grounded in stored knowledge or surface pattern-matching.

Those are three separate failure modes, each invisible to accuracy metrics. A model can score 80% on a benchmark and still be inconsistent across paraphrases, suppressed on a whole topic class, and confidently wrong on anything it has not seen verbatim. The eval system surfaces all three without requiring a trained SAE or any model-specific configuration.

Consistency

How it’s measured

Genuine knowledge is phrasing-invariant. “The capital of France is ___” and “Q: What is the capital of France? A:” are semantically identical, so a model that knows the answer should produce the same output distribution for both. Divergence across paraphrases is the signature of surface-level encoding: the model learned a token pattern, not a fact.

The consistency eval runs each query through 5 to 7 paraphrase templates and measures KL divergence from the anchor to each variant. The consistency score is 1 - (mean KL / anchor entropy). A score near 1.0 means the model is stable across phrasings. A score near 0 means confidence collapses as framing becomes indirect.

Results

“Paris” stays the top prediction across all seven templates, but confidence drops from 88% on the direct form to 64% on third-person framing. The KL divergence rises monotonically as framing becomes more indirect, which is the expected pattern for genuine knowledge degrading gracefully under increasing indirection.

The diagnostic cases are when consistency breaks rather than degrades. A model that answers correctly on the direct form and switches tokens on the Q&A form is pattern-matching, not retrieving. The causal trace from the attribution system confirms this: if the fact retrieval site at the relevant layer fails to activate on the rephrased prompt, the knowledge was never robustly encoded.

Suppression

How it’s measured

Outright refusal is easy to detect. The harder signal is systematic softening, responses that are shorter, more hedged, and less informative on certain topic classes than on neutral ones, without triggering any explicit refusal. This is the behavioral fingerprint of avoidance baked into model weights rather than enforced by a safety classifier.

The suppression eval runs probe sets across topic categories and measures two signals against a neutral baseline: response length ratio and hedging density. The suppression score is 0.6 x length_penalty + 0.4 x hedge_penalty. Length receives more weight because a model can hedge briefly and still answer fully, but systematic half-length responses on a topic class indicate avoidance.

Results

Medical and legal topics show the strongest suppression signal. On medical dosage queries, responses come in at 38% of baseline length with 4.2x the hedging density. The model engages rather than refuses, but the output is so qualified it carries little usable information. Basic science runs clean at 1.02x baseline length with no elevated hedging.

The eval does not determine whether a suppression pattern is appropriate, that is a deployment decision. What it does is make the pattern visible and quantified. A suppression score of 0.71 on medical topics is the starting point for intervention: fine-tuning, prompt-level overrides, or targeted weight editing via the attribution system.

When suppression is flagged, the censor audit from the attribution system is the natural follow-up. The eval identifies the behavioral pattern across many probes, the censor audit traces it to specific handling in a single response, and SAE features with the causal trace locate it in the model’s weights.

Knowledge Boundary

How it’s measured

Confidence is not evidence of knowledge. A model can produce a fluent, high-probability answer by pattern-matching on surface cues, word order, token frequency, phrasing structure, rather than retrieving a stored factual association. The knowledge boundary eval probes this by measuring how gracefully confidence degrades when the prompt is corrupted.

Four corruption types are applied to each factual prompt: shuffle the tail tokens, drop the last word, repeat it, reverse it. For each, the drop in confidence on the clean answer is measured. The robustness score is 1 - (mean_drop / clean_confidence). High robustness means the fact survives moderate prompt noise. Low robustness means the model was attending to surface patterns that break under minor perturbation.

Results

The gradient is clear. Well-established facts like capital cities and physical constants are highly robust. The Treaty of Westphalia starts to break down. The Zhukov offensive date hits 0.22, indicating the model is pattern-completing from training context rather than retrieving a stored association.

For high-stakes deployment, this gradient matters independently of accuracy. A model answering questions about drug interactions with 0.22 robustness carries a different risk profile than one at 0.88, even if both produce the same token on the clean prompt.

Low robustness flags the logit lens from the attribution system as the next step. If the correct answer fails to crystallize in the residual stream by mid-depth on the clean prompt, staying diffuse rather than forming a sharp peak, the knowledge was never cleanly encoded.

The relationship to attribution

The three evals are deliberately behavioral, no SAE required, no model-specific setup, runs immediately on any TransformerLens-compatible checkpoint. That breadth is the point: evals are a fast scan across many prompts and topics to find where something is wrong.

What they cannot do is explain why. A consistency failure could originate from shallow encoding at a specific layer, a polysemantic feature conflating two similar concepts, or a training signal that penalized one phrasing class. The behavioral signal is the same in all three cases. Attribution is how you tell the difference.

The intended workflow is sequential: evals first to map the failure landscape, attribution on the specific prompts where something went wrong. Evals are wide and fast. Attribution is deep and specific. Together they close the loop between “this model has a problem” and “here is where it lives in the weights.”

Benchmarks

Aquin Labs — Fri, 22 May 2026 06:59:40 GMT

How Aquin decides which SAE features are trustworthy, and how you can measure model capability mid-session without leaving your inspection context.

Part 1: Feature Benchmarks

Most SAE features get used because they have a plausible-sounding label and a clean activation plot. That is not enough. A label can be wrong. A feature can be coherent but causally irrelevant. Before a feature earns a place in a circuit graph or a steering experiment, three properties need to hold independently: the label predicts where it fires, it is monosemantic, and it actually does work in the forward pass.

These conditions are orthogonal. A feature can pass two and fail the third in any combination. Aquin scores all three separately and surfaces them as a diagnostic triple. The combination tells you what to do next: relabel, filter, or trust.

InterpScore

The question InterpScore answers: does this feature’s label predict when it fires? Two sentence sets are built per feature, one where the label implies the feature should activate and one where it should not. Both pass through the model, maximum activation at layer 8 is extracted per sentence, and Cohen’s d is computed between the two distributions. The result clips to [0, 1].

A score near 1 means the label and the feature agree. A score near 0 means they have drifted, so treat the auto-generated label as a guess. Each feature uses 10 positive and 10 negative sentences, 20 separate forward passes through the full model and SAE.

FeaturePurityScore

InterpScore evaluates the label. FeaturePurityScore evaluates the feature itself, with no label involved. The sentences where the feature fired above threshold are embedded, and mean pairwise cosine similarity of the upper triangle is computed, excluding self-similarity.

High purity means activating contexts cluster tightly in embedding space, so the feature is monosemantic. Low purity means it is firing on surface-level co-occurrence rather than a coherent concept. Polysemantic features concentrate near the sparsity penalty boundary, which is consistent with what the superposition hypothesis predicts.

Model Utilization Index

A feature can look perfect on InterpScore and FeaturePurityScore and still be inert. The model computes it but does not route through it. This is the gap MUI is designed to close. Some cleanly labeled, monosemantic features produce near-zero KL divergence under ablation. They are decorative.

MUI measures causal load directly. At each token position where the feature fires above threshold, its projection onto the residual stream is zeroed and the forward pass re-runs. KL divergence between baseline and ablated output distributions is computed at that position, averaged across all firing positions, and normalized by baseline Shannon entropy. The result is a [0, 1] score of how much the model’s output depends on this feature when it is active.

Reading the scores together

The three scores are a diagnostic triple, not a leaderboard. The most actionable pattern is high purity and high MUI with low InterpScore. The feature is coherent and causally load-bearing, but its label is wrong. A relabeling pass using the actual activating examples usually resolves it in minutes. The all-low pattern is a dead feature, appearing disproportionately near the sparsity penalty boundary, and it should be filtered before any downstream analysis.

Part 2: The Benchmark Builder

Standard benchmark workflows require selecting a suite, configuring a harness, running the eval, and parsing results out-of-band. For scheduled evaluations that pipeline is fine. For a question that surfaces mid-inspection, say a suspicious feature or an unexpected output, it is a full context switch that almost never happens. The question gets dropped.

The Benchmark Builder removes the context switch. You describe what you want to measure in natural language, the agent writes the prompt suite, runs it against whatever is currently loaded, and returns a scored card in the thread, grounded in the same session that surfaced the question.

The contexts

The same natural-language request produces different prompt suites and scores depending on what is loaded. Context is recorded in card metadata and carried through all exports.

model inspection: Prompts run directly against the loaded model. No re-specification needed.

2. raining monitor: Benchmarks a checkpoint at a specific training step. Results are indexed by step and tracked in the regression panel.

Scoring methods

Each capability dimension scores 0 to 100. The agent selects the method based on task type and records it in card metadata. A 67% on CoT math with partial-credit scoring is not the same as a 67% on factual recall with next-token probability. The method is part of the result.

Reading results

Scores are relative to the generated prompt suite, so they are not directly comparable to published leaderboard numbers unless you explicitly request a named standardized benchmark. The most reliable use is within-session comparison: run the same request against two models or checkpoints and compare rank order, not absolute values.

A low score is a starting point, not a verdict. A reasoning score of 67% driven by spatial failures is a different problem from one driven by arithmetic failures. A follow-up benchmark scoped to the sub-type disambiguates in one additional request.