The first eval suite a team writes is almost always wrong. It is wrong in a specific and predictable way: it consists of synthetic test cases the team designed by trying to imagine what could go wrong, scored by a metric the team picked because it was easy to calculate. The suite passes consistently in the design review. It fails to predict the regressions that actually surface in production. Around week three, somebody points out that the eval suite has been green for the entire window in which two known incidents shipped. The suite gets blamed. It usually deserves to.
Eval suites that survive production traffic are different in three specific ways. They are seeded from production data, not from synthetic test cases. They are scored by editorial judgment, not by a single similarity metric. And they are versioned with the corpus, not the model. The combination produces a suite that catches the regressions that actually matter, instead of catching the regressions the suite designer thought to anticipate.
What follows is the eval architecture we run on every install, the failure modes we have stopped tolerating, and the specific instrumentation that makes the suite cheap enough to run on every model promotion.
Production-seeded test cases.
Synthetic test cases — the ones a designer writes by hand at the start of the project — capture about a third of the regression surface in the deployments we have measured. The other two-thirds is in the long tail of inputs the designer did not imagine. Production traffic is the only source of those.
We seed every eval suite with a stratified sample of production traffic. The stratification matters: we sample by workflow, by document type, by editor cohort, and by editorial outcome (accept, reject, edit, escalate). The result is a test set that reflects the actual distribution of inputs the deployment is being asked to handle, including the edges the designer did not anticipate.
The sample is refreshed quarterly. Old cases stay in the suite as a regression baseline; new cases get added to capture distribution drift. The suite grows over time, which is the point — it is a record of what the deployment has been asked to do, not a snapshot of what the designer thought it would be asked to do.
Editorial scoring, not similarity metrics.
The default scoring metric for most eval suites is some flavor of similarity — BLEU, ROUGE, embedding cosine — between the model's output and a reference output. These metrics are easy to calculate, easy to chart, and structurally bad at catching the regressions that matter. A model can score well on similarity to the reference and still produce an output the editor would reject. A model can score badly on similarity and produce an output the editor would accept.
We score with editorial judgment. Each test case has an editorial rubric — typically three to five criteria specific to the workflow — and the model's output is scored against the rubric by a human editor or, increasingly, by a different model that has been calibrated against editor decisions. The rubric is the truth. The similarity metric is a sanity check.
Editorial scoring is more expensive than similarity scoring. It is also the only scoring that correlates meaningfully with production outcomes. We have tracked the correlation across eight installs. Similarity scores correlate with editor accept rates at about 0.3. Rubric-based editorial scores correlate at 0.84.
Versioning the eval against the corpus.
Most eval failures we have investigated turn out to be corpus drift, not model drift. The model is the same. The corpus changed. The eval test case, which assumed a particular corpus state, no longer makes sense against the current corpus.
The fix is to version the eval suite against a corpus version. Every eval run is a tuple: (model version, corpus version, eval suite version). Regressions are attributed to the dimension that changed. If the eval suite version and corpus version are constant and the model regressed, the model regressed. If the corpus version changed, the regression may be a corpus issue, and the investigation goes there first.
This requires the corpus to be versioned, which we wrote about in a separate dispatch. The eval architecture and the corpus architecture are coupled. You cannot have eval discipline against a corpus that is not versioned, because every eval run is essentially measuring against a different target.
Three failure modes we have stopped tolerating.
After a few install cycles, three eval-suite failure modes have become diagnostic. When we audit a deployment and find any of these, we know the suite is theater.
A green suite for sixty consecutive days. This is rarely a sign of a healthy deployment. It is usually a sign that the suite is not measuring anything sensitive enough to fluctuate. Real production deployments have noise; eval suites that show no noise are the ones that have decoupled from production.
Eval cases that the design team curated and never refreshed. A suite that looks the same in week thirty as it did in week one has lost its grounding in production traffic. Distribution drift will silently invalidate the suite.
Single aggregate score with no per-rubric breakdown. The suite reports a number; nobody knows which rubrics moved. Aggregate scores hide regressions in specific dimensions. Per-rubric reporting is the minimum viable instrumentation for catching what production actually cares about.
What the runtime looks like.
The eval pipeline runs on every model promotion candidate. The pipeline pulls the current production-seeded test set, executes against the candidate model and the current production model in parallel, scores both with the editorial rubrics, and produces a per-rubric delta. A promotion is approved only if no rubric regressed past the threshold and at least one improved beyond noise.
The pipeline is fast — five to twenty minutes for the typical install — because the test set is bounded (a few hundred to a few thousand cases) and the scoring is parallelized. The cost is dominated by the editorial scoring step, which is why model-based scoring calibrated against editor decisions has become the default. We use a separate, smaller eval suite scored entirely by humans to keep the model-based scorer honest, and we recalibrate quarterly.
Where to start if your current eval suite is theater.
If the eval suite you are running is producing reassuring numbers that have stopped catching what production cares about, the rebuild is a one-week project, not a quarter-long initiative. The shape:
- Pull a stratified sample of production traffic from the last thirty days.
- Score each sample case against three to five editorial rubrics specific to the workflow. Use editors for the first pass.
- Promote the scored cases to the eval suite, replacing the synthetic cases.
- Calibrate a model-based scorer against the editor scores. Validate that the calibration holds against a held-out set.
- Run the new suite against the current model. Whatever number it produces is the new baseline.
The number is usually lower than the synthetic-suite number was. That is not a regression. That is the suite measuring the right thing for the first time. From that baseline, every model promotion is a real test, and the regressions that surface in production stop being surprises.
Eval suites that survive production traffic are not more sophisticated than the ones that do not. They are seeded from production, scored by editorial judgment, and versioned with the corpus. The architecture is simple. The discipline is sustained. The combination is what separates a deployment that improves under measurement from a deployment that just looks like it is being measured.