Postmortem: when our brand-fit index drifted across two model promotions.

On March 8, 2026, our brand-fit index — the metric that captures how well the model's outputs match the editorial team's calibrated voice — surfaced as having drifted downward by 0.07 over an eight-week window. The drift was distributed across two consecutive model promotions. The first promotion dropped the index by 0.03; the second by 0.04. Both promotions passed our eval suite individually. The cumulative drift was material. Customers had been quietly noticing slightly more correction work for several weeks; nobody escalated because each individual draft was within the editor's tolerance. The cumulative pattern was not visible until our quarterly editorial review surfaced it.

Timeline.

January 14, 2026 — Model promotion (v3.1 → v3.2). Eval suite passes. Brand-fit index drops 0.03 in the eval, classified as within noise.
January 14 – March 8 — Production traffic continues. Editor correction rate ticks up slightly, attributed to natural variance.
February 11, 2026 — Model promotion (v3.2 → v3.3). Eval suite passes. Brand-fit index drops another 0.04, again classified as within noise.
February 11 – March 8 — Editor correction rate ticks up further. Still attributed to natural variance.
March 8, 09:00 UTC — Quarterly editorial review pulls a stratified sample of trailing-eight-weeks generations. Reviewer notes that the sample's brand-fit signal is materially weaker than the prior quarter's.
March 8, 11:30 UTC — Engineering team confirms in the production telemetry. The brand-fit index is 0.07 below the December baseline. Cumulative drift is significant; per-promotion drift was within noise.
March 8, 14:00 UTC — Decision: not a rollback. The two promotions had real benefits beyond brand-fit. Rolling back would have lost those benefits. Instead, schedule a targeted brand-voice fine-tune cycle to recover the calibration.
March 9 – March 13 — Targeted fine-tune cycle runs against the editor team's latest accept/reject decisions, weighted toward voice calibration.
March 13 — Updated model (v3.3.1) promoted. Brand-fit index recovers to 0.02 below the December baseline, well within noise.
Total drift window: 8 weeks. Window during which cumulative drift was unrecognized: 8 weeks. Time from recognition to remediation: 5 days.

Root cause.

Two model promotions, each with a small drift on the brand-fit dimension. Each individual drift was within the eval suite's noise threshold. The cumulative effect of two consecutive drifts in the same direction exceeded any individual drift threshold but was not measured against any cumulative threshold.

The promotions were correct to ship. Each one delivered measurable improvements on dimensions other than brand-fit. The eval suite correctly identified that no single promotion produced a material brand-fit regression. What the eval suite did not catch was that two consecutive promotions in the same direction on the same dimension produced a regression worth treating as material.

What worked.

The quarterly editorial review caught the pattern. This is the same mechanism that caught the eval-coverage gap we wrote about previously. Quarterly review is the catch-net for what the per-promotion eval suite does not detect. The investment in the review pays back specifically on slow-drift incidents that the per-promotion gates do not bound.

The targeted fine-tune cycle recovered cleanly. The brand-fit calibration was recoverable through a fine-tune cycle weighted toward voice signal. The recovery took five days. The model retained the improvements from the prior two promotions on dimensions other than brand-fit. The recovery was net-positive on every dimension.

What did not work.

The eval suite tracked per-promotion drift, not cumulative drift. The drift threshold was correctly calibrated for the noise of any single promotion. It did not measure trailing-window drift across promotions. A series of small same-direction drifts could accumulate to a material regression without any individual drift triggering the threshold.

Editor correction-rate increase was attributed to variance. The signal was there in the operations dashboard. The editor team's correction rate had ticked up over the eight-week window. The dashboard surfaced the trend but the team's interpretation was "natural variance" because the per-week change was within the dashboard's noise band. The trailing-eight-weeks aggregate was outside the band; the dashboard did not surface trailing-eight-weeks aggregates prominently.

Process changes.

Cumulative-drift threshold added to the eval suite. Each editorial rubric now has a trailing-window drift threshold (set at 0.05 cumulative across any rolling four-promotion window). Promotions that would push the cumulative drift past the threshold trigger an architecture review even if the individual promotion is within noise.
Operations dashboard adds trailing-window aggregates. The editor correction-rate metric now plots both per-week and trailing-eight-weeks. The trailing-window plot is given equal prominence; teams can see slow drifts as easily as week-on-week noise.
Quarterly editorial review's role formalized in the eval architecture. The review is no longer just a quality check; it is a defined input to the eval suite, with the reviewer's stratified-sample observations feeding back into the cumulative-drift thresholds.

What this incident demonstrated.

The eval suite that we wrote about in the eval-suite dispatch has to measure not just per-promotion regression but cumulative drift across promotions. A series of small same-direction drifts is a different failure mode than a single large drift, and the per-promotion thresholds do not bound it. The trailing-window threshold is the bounded fix.

The deeper lesson is that operational dashboards have to surface slow drifts with the same prominence as per-window changes. Per-window changes are easy to see and easy to dismiss as noise; slow drifts are harder to see and more important when they are real. The dashboard that biases toward per-window prominence is structurally biased toward missing slow drifts. The architectural change — equal prominence for trailing aggregates — closes the bias.

If your AI deployment runs scheduled model promotions and your eval thresholds are per-promotion, the next slow-drift incident is not bounded by your current architecture. The cumulative threshold is a small process change with a meaningful safety property. Adding it to the eval suite is a one-week project. Operating without it has a known cost.

J. ReichertPRINCIPAL ENGINEER · KNYTE

Twelve years on production retrieval and inference systems. Previously at Stripe (risk infra) and Anthropic (eval tooling). Writes about the boring parts of agentic infra.

RECENT

Postmortem →

RECENT

Streaming model outputs without losing the editorial gate. →

RECENT

Designing queryable memory for enterprise retrieval pipelines. →

KEEP READING

More from the dispatch.

All posts →

FIG. 2026DURATION: 3H 24M

The eval suite passed. The parsers did not.

POSTMORTEMS

Postmortem: a model upgrade that flipped the citation style.

04.28.2610 MIN · JR

FIG. 2026DURATION: 38 MIN

Two indexer versions, one corpus, one promotion gap.

POSTMORTEMS

Postmortem: the day our queryable memory returned stale embeddings for forty minutes.

04.11.269 MIN · JR

FIG. 2026DURATION: 2H 11M

The queue was healthy. The throughput was dying.

POSTMORTEMS

Postmortem: a queue-saturation incident that wasn't the queue.

04.04.2611 MIN · DC

Postmortem: when our brand-fit index drifted across two model promotions.

Timeline.

Root cause.

What worked.

What did not work.

Process changes.

What this incident demonstrated.

More from the dispatch.

Postmortem: a model upgrade that flipped the citation style.

Postmortem: the day our queryable memory returned stale embeddings for forty minutes.

Postmortem: a queue-saturation incident that wasn't the queue.

Get the dispatch in your inbox.