Skip to main content
POSTMORTEMS

Postmortem: when the eval suite missed a regression for six weeks.

A subtle regression in a long-context handling pattern shipped six weeks before the eval suite caught it. Here is what the suite was measuring, what it should have been measuring, and the rebuild that closed the gap.

By D. ChoSTAFF ENGINEER · KNYTE
PUBLISHEDJANUARY 29, 2026
READ TIME11 MIN
CATEGORYPOSTMORTEMS

On January 22, 2026, an editorial-review session surfaced a regression in our long-context handling that turned out to have been live for six weeks. The model had been silently degrading on a specific class of long-context input — documents over roughly twelve thousand tokens with multiple distinct topical segments — across two model promotions in early December and mid-January. The eval suite had been green for both promotions because the suite's long-context cases happened to be either much shorter or topically uniform. The regression was real. The eval coverage gap was the actual root cause. We rebuilt the long-context section of the suite, identified eleven downstream incidents that had been routed to manual handling without escalation, and tightened the suite's coverage discipline. We are publishing the postmortem because the coverage-gap pattern is the most common eval-suite failure mode, and the rebuild approach is replicable.

Timeline.

  • December 4, 2025 — Model promotion (v3.0). Eval suite green. New model serves production.
  • December 4 – January 22 — Production traffic steady. Eleven editor-flag incidents on long-context, multi-topic documents. Each is routed to manual handling and resolved at the case level. No escalation pattern is detected because the manual-handling rate stays within the noise floor of the operations dashboard.
  • January 14, 2026 — Model promotion (v3.1). Eval suite green. New model serves production.
  • January 22, 09:14 UTC — Editorial team holds a routine quarterly quality review. The review pulls a stratified sample of the trailing eight weeks of generations. The reviewer notices a pattern: drafts produced from inputs over twelve thousand tokens with multiple topical segments are noticeably weaker than drafts from the same length range with a single topic.
  • January 22, 11:30 UTC — Engineering team confirms the pattern in the production telemetry. The eleven previously-handled-as-individual-cases are revealed as the same regression.
  • January 22, 14:18 UTC — Engineering identifies the eval-suite gap. The long-context test cases in the suite are either short (under 4k tokens) or topically uniform. There are no cases that match the production failure profile.
  • January 23 – January 26 — Eval suite long-context section rebuilt with twenty new test cases drawn from the production failure pattern. Both v3.0 and v3.1 are re-evaluated against the rebuilt suite.
  • January 26 — Both v3.0 and v3.1 fail the new long-context section. v2.4 (the model running before December 4) passes. Regression confirmed at v3.0.
  • January 27 — Decision: roll back to v2.4 and re-derive v3.x improvements without the long-context regression. The rollback is operationally safe because the model is held warm and the feature surface has not depended on v3.x-specific behavior.
  • January 29 — v2.4 in production. Affected workflow output volume normalizes within 24 hours.

Root cause.

Two model promotions (v3.0 in December and v3.1 in January) included training data and architecture changes that subtly degraded the model's handling of long, multi-topic inputs. The eval suite did not catch the degradation because the suite's long-context cases were not stratified to include the multi-topic profile. The cases that exercised long context were topically uniform; the cases that exercised multi-topic handling were short. Neither case profile triggered the failure mode.

Production traffic, by contrast, included a meaningful fraction of long, multi-topic inputs because that is what enterprise editorial workflows actually look like. The model failed on this fraction of traffic, and the editor-in-the-loop layer caught the failures and resolved them at the case level. Each individual case was handled correctly. The pattern across cases was not detected because the operations dashboard's noise floor was higher than the regression's signal floor.

What worked.

Editor-in-the-loop caught every individual case. No bad output reached a downstream system. The customer-facing impact was zero across the six-week window. The runtime behaved correctly at the case level even though the model was degraded.

The quarterly editorial review surfaced the pattern. The review is a deliberate practice — pull a stratified sample of recent outputs, look for patterns the operations dashboard might be missing. The pattern surfaced exactly because the review was looking for them. Without the review, the regression might have continued for another quarter.

The rollback path was safe. Holding the previous model warm allowed a low-risk rollback. The runtime config flip executed in under two minutes. No re-platforming was required.

What did not work.

The eval suite's long-context coverage was incomplete. The suite measured what it had been built to measure. The coverage gap meant the suite reported green on a model that had a real regression. This is the structural failure mode we wrote about in the eval-suite dispatch, and this incident is a textbook example.

The operations dashboard's noise floor was higher than the regression's signal. Eleven extra manual-handling cases over six weeks did not exceed the dashboard's escalation threshold. The dashboard was tuned to detect spikes, not slow drifts. The regression was a slow drift.

Test-case stratification was implicit, not explicit. The eval suite had test cases that covered "long context" and "multi-topic" as separate dimensions. The suite did not enforce coverage of the intersection. The intersection is where the regression lived.

Process changes.

  1. Eval suite stratification matrix. Every workflow's eval section now declares the dimensions on which it expects the model to be robust (length, topic count, citation density, formality, etc.). The matrix requires test coverage on every cell, not just the marginals. Cells with insufficient coverage block the eval suite from reporting green.
  2. Production traffic sampling into the eval suite weekly. Each week, a stratified sample of production traffic is added to the eval suite as candidate cases. Cases that the editorial team marks as having quality concerns become eval cases automatically. The suite grows with the production distribution rather than calcifying around the original design.
  3. Slow-drift detector on operations metrics. The dashboard now runs a six-week trailing-window detector against per-workflow editor-handling rates. Slow drifts above a threshold escalate even if the absolute rates remain below the spike threshold.
  4. Quarterly editorial review now produces a stratified-sample report that is shared back to the eval-suite owners. The review is no longer just for catching regressions; it is also a feedback channel into the eval suite's coverage matrix.

What this incident demonstrated about the architecture.

Editor-in-the-loop bounds the impact of an undetected regression. The six-week window is, in any other architecture, a customer-facing quality crisis. In this architecture, the customer impact was zero because every affected output was caught at the editor layer. The regression was real and operationally costly — eleven cases of manual handling, weeks of degraded throughput on the affected workflows — and it was not catastrophic. The editor-in-the-loop default is what made the difference.

Eval suites need coverage discipline, not just case discipline. The number of eval cases is not the right measurement. The coverage of the dimensional matrix is. Adding more cases against the same coverage profile produces more confidence in what the suite already measures, not better confidence in what the suite misses.

Slow-drift detection is a separate operational discipline from incident detection. The operations dashboard was tuned for incidents. The regression was a drift. Drift detection requires a different statistical machinery and a different escalation policy.

We publish this postmortem because the eval-coverage gap is the most common failure mode of mature eval suites. The suite that is sophisticated and trusted is the suite most likely to have this exact gap, because the team's confidence in the suite has crowded out the discipline of asking what the suite is not measuring. The dimensional matrix is a small piece of process that closes the structural gap.

D. ChoSTAFF ENGINEER · KNYTE

Built distributed retrieval at Pinterest and Databricks. Spends most days inside trace viewers and the rest writing about why your eval suite lies to you.

SUBSCRIBE

Get the dispatch in your inbox.

Twice a month. We send the essay, the postmortem, and nothing else. No roundups. No tracking pixels pretending to be personalization.

NO SPAM · UNSUBSCRIBE ANYTIME · 4,200 READERS