Knowing When the Model Is Actually Right

Aditya Mittal — Thu, 30 Apr 2026 02:39:36 GMT

The first thing I built when I joined the team wasn’t the model. It was the eval harness.

That sounds like a contrarian take, but it isn’t. It’s just what the work demanded. The product ships coaching suggestions to sales agents while they’re on live calls. The model hears an utterance and tells the rep what to say next. When I joined, the system existed but the suggestions weren’t great, and the team wanted to make them better. The obvious move was to start training. The problem was that I had no way to tell whether anything I changed was actually an improvement.

That’s a worse problem than it sounds. Without an eval, every change becomes a coin flip. You ship something, the team thinks it feels better, you ship something else, somebody thinks it feels worse, and after a few weeks of this you’ve moved the model around but you have no idea whether you’ve moved it forward. The history of ML is full of teams that thought they were improving things and weren’t. I didn’t want to join that history.

So before I trained anything, I built an 8-dimension LLM-as-judge harness. Eight dimensions might sound like overkill, and I went back and forth on whether to start with fewer. The reason I landed on eight was that “is this suggestion good?” turns out to decompose into a lot of independent things, and if you collapse them all into one score, you lose the ability to tell which part of the model got better or worse on a given change.

What the eight dimensions are

The harness scores every suggestion on two axes. Four dimensions per axis.

Signal quality. Is the suggestion the right thing to say at this moment in the call? This breaks into relevance (is it on-topic), specificity (is it concrete enough to act on), stage awareness (does it match where the call is in the sales process), and methodology alignment (does it follow the technique our coaching is based on).

Delivery quality. Even if the suggestion is right, is it usable? This breaks into actionability (can the rep act on it without parsing), tone (does it match how the rep talks), timing (does it arrive when it matters), and non-repetition (is it different from what we just said).

The reason for two axes is that signal and delivery fail in different ways and have different fixes. A model can be saying the right things badly, or the wrong things smoothly. If you have one composite score, those two failure modes look identical from a number on a dashboard. With two axes, you can tell whether the next change should be in training data or in prompt structure.

I’m not going to pretend the dimensions are perfect. They’re a working hypothesis. The set we have now is the second version. The first had different definitions for stage awareness and tone, and I tightened them after watching the judge disagree with itself on edge cases. The dimensions will probably keep evolving. That’s fine. What matters is that we have some decomposition, and we use it consistently.

Why LLM-as-judge, and which LLM

The judge is a small, fast model from a different family than the candidate. There are a few decisions buried in that sentence.

First, why an LLM judge at all instead of, say, embedding similarity against a curated reference set. The honest answer is that the suggestions are too open-ended for reference comparison to work. There’s no single right answer at any given moment in a sales call. There are a lot of decent ones and a few great ones, and the difference is judgment, not exact wording. Embedding similarity rewards saying things that look like the reference, which in practice rewards saying nothing distinctive. An LLM judge can evaluate whether a suggestion makes sense given the call context, even if the wording is novel.

Second, why a different model family than the one being evaluated. Self-evaluation has a known bias problem. Models tend to rate their own outputs highly, even when they shouldn’t. Picking a judge from a different lineage sidesteps this. Different training, no shared confidence in the same answers.

Third, why a small fast model and not a larger one. The volume matters. We run the judge on every suggestion in our eval set, on every release, plus async on a sample of live calls for champion/challenger. A larger judge would have been more accurate per call, but we’d have run it less often, which is the wrong tradeoff. Better to run a slightly noisier judge constantly than a perfect judge rarely.

Fourth, one prompt per dimension instead of one prompt that scores all eight at once. This was a real source of judge variance. When I tried scoring all dimensions in a single prompt, the judge would anchor on whichever dimension it weighted first and let that bleed into the others. One dimension per prompt is more API calls but the scores are independent, which is the whole point.

The deployment loop

The eval harness only earns its keep if it actually gates what ships. Two layers.

Offline gating before deployment. Every candidate model (fine-tune, prompt change, anything) has to beat the production baseline composite on a held-out eval set before it’s allowed near production. The eval set is a few hundred suggestions covering the full range of stages and call types. If a candidate doesn’t beat baseline, it doesn’t ship. Doesn’t matter how interesting the change was.

Live experiment after deployment. Once a candidate clears offline eval, it gets promoted to challenger and starts serving a small fraction of real traffic.

The wrinkle is the comparison itself. A naive split puts challenger on 10% and the incumbent on 90%, then averages their scores. That has a measurement problem. The two cohorts are different calls handled by different agents. If challenger’s cohort happens to draw harder customers, its score looks worse than it should. The cohort got tougher material, not the model.

The fix is to run the incumbent in shadow on the challenger cohort. Challenger generates what the rep sees. The incumbent generates a suggestion on the same call in parallel, and we throw its output away but score it anyway. Both models get judged on the same inputs, every call.

We score cohort calls async, not in the hot path. If challenger holds up under live distribution it gets promoted. If it regresses on real calls in ways offline eval didn’t catch, we roll back.

The reason for both layers is that offline eval and production eval test different things. Offline eval tests whether the model is better on the cases I curated. Production eval tests whether the model is better on the cases users actually generate. Those distributions overlap but aren’t identical, and the gap is where regressions hide.

What I learned from building this

Three things, in rough order of how surprised I was by them.

An eval harness pays for itself almost immediately. I expected building this to feel like overhead, work that delays the real work. It didn’t. The first time I shipped a change and the harness told me which dimensions had moved, I had information I’d never have gotten from feel. From that point forward, every conversation about model quality became more concrete. The team stopped saying “this feels better” and started saying “this is up on specificity but down on stage awareness.”

The judge model is part of the system. I underweighted this when I started. The judge has its own failure modes, its own biases, and its own drift over time. Treating it as a fixed measurement instrument was a mistake. I now think of the judge as another model in the system, with its own version history and its own evaluation needs. The harness now has a calibration loop built in. Every release I score a small sample of suggestions myself, and we track how well my ranking lines up with the judge’s. If the agreement starts drifting, that’s the cue to investigate before trusting any new scores.

Composite scores hide more than they reveal. The composite improvement we report is the headline number, and it’s true. But it’s not the most useful number. The dimension-level deltas are where the engineering work actually lives: what improved by how much, what regressed, what stayed flat. The composite is for stakeholders. The dimensions are for me.

What I’d do differently next time

Three things now built into the design from day one, rather than bolted on later.

Calibrate the judge against humans before you trust it. I built the first version, started trusting the scores immediately, and didn’t run a calibration pass for months. The judge was close enough that the harness was useful, but there were specific dimensions where it was off in ways I only caught after the fact. The current design has the calibration loop running from the first release. A small sample, scored by me alongside the judge, with the rank correlation tracked over time.

Version the eval set explicitly. The eval set evolves as the product covers new call types. Early on I didn’t version it carefully, which means some of the early scores aren’t directly comparable to current ones. The current design treats the eval set as a versioned artifact. Each set has a manifest covering which stages and call types it includes, plus a check that no eval transcripts leaked into the training data.

Build the live experiment loop earlier. I built offline eval first and live scoring second, with months in between. The cases where the two layers disagree are the most interesting ones. Live scoring catches regressions offline eval misses, and offline eval catches regressions live scoring doesn’t have the volume to detect. The current design has both from the start.