On March 8, 2026, a scheduled fine-tune cycle produced a model that drifted noticeably toward an outlier voice across all subsequent generations. The drift was caught by the editorial team in the first hour of production traffic against the new model. We rolled back, identified the six training examples that had produced the drift, removed them from the cycle's input, re-ran, and promoted the corrected model six hours after the original promotion. Customer-facing impact was bounded by editor-in-the-loop catching the affected drafts before they shipped. The deeper consequence was a process gap in our fine-tune input curation that we have closed.
Timeline.
- 08:14 UTC — Scheduled brand-voice fine-tune cycle begins. Input set: 412 editor-accepted drafts from the prior two weeks.
- 08:53 UTC — Fine-tune completes. Eval suite runs against the new model. Editorial-rubric scores: tone +0.02, fact +0.01, length 0, brand-fit −0.01. No regressions exceed the promotion threshold.
- 09:01 UTC — New model promoted to production via runtime config flip. Old model held warm for fast rollback.
- 09:34 UTC — Editorial team flags first drift case in the operations channel. The draft uses a register noticeably more formal than the brand voice has consistently been calibrated to.
- 09:38 UTC — Second flag, separate workflow. The drift pattern matches the first.
- 09:42 UTC — Third flag. On-call engineer initiates rollback. Production traffic returns to the prior model within 90 seconds.
- 09:48 UTC — Engineering and editorial leads start root-cause investigation. Hypothesis: training input contamination.
- 11:20 UTC — Investigation identifies six drafts in the 412-input set that came from a single editor's batch the prior week, all with an unusually formal register that did not match the broader brand voice. The editor had been working on a specific subset of high-formality outputs for a regulatory-compliance project.
- 11:45 UTC — Six drafts removed from the input set. Fine-tune re-run with 406-input set.
- 12:38 UTC — Fine-tune re-run completes. Eval scores match baseline within noise. Brand-fit rubric specifically: −0.001 (no drift).
- 13:55 UTC — Editorial team manually validates the corrected model against twenty held-out cases. All twenty match the calibrated brand voice within editorial tolerance.
- 14:18 UTC — Corrected model promoted. Warm-rollback to baseline removed.
- Total elapsed time from original promotion to corrected promotion: 5 hours 17 minutes. Affected production drafts during the drift window: 38. Drafts that reached the editor queue: 38. Drafts that shipped to a downstream system: 0.
Root cause.
The training input set for the fine-tune was the 412 editor-accepted drafts from the trailing two weeks. Six of those drafts were from a single editor's batch on a regulatory-compliance project. The drafts were editorial-accepted because they were correct for the regulatory context — formal, hedged, and conservative. They were unrepresentative of the brand voice in the broader sense, and the fine-tune cycle treated them with the same weight as the other 406 drafts.
The eval suite did not catch the drift because the brand-fit rubric was scored against a held-out set that did not include any high-formality regulatory drafts. The rubric measured the model's drift against the calibrated baseline, but the baseline did not include the dimension along which the drift was happening. The eval signal was within noise because the noise floor was, in this case, larger than the signal.
What worked.
Editorial team flagged within 30 minutes. The drift was visible to humans before any automated metric caught it. The editorial team's calibration on the brand voice is more sensitive than any automated rubric we currently run. The flag arrived through the operations channel within thirty minutes of the new model going live.
Rollback in 90 seconds. The runtime config flip is fast. The corrected model was being served within a minute and a half of the rollback decision.
Editor-in-the-loop caught all 38 affected drafts. None of the drift-affected outputs reached a downstream system. The customer-facing impact was zero.
What did not work.
The eval suite missed the regression. The brand-fit rubric was within noise on the affected dimension because the held-out set did not stress the dimension. This is the exact failure mode we wrote about in the eval-suite dispatch — eval suites that are not seeded broadly enough fail to catch what production cares about.
The training input curation had no balance check. The 412 drafts were used as-is, with no analysis of whether the set was representative across the dimensions the model would be expected to perform well on. A simple stratification check would have flagged the regulatory-compliance batch as overrepresented relative to the broader brand-voice distribution.
Process changes.
- Training input set must pass a stratification check before fine-tune cycles can begin. The check requires the input set to be approximately representative of the production query distribution across editorial-domain, document-type, and length dimensions. Outlier batches are flagged for review and either rebalanced (additional inputs added) or excluded.
- Eval rubric for brand voice extended to include domain-stratified scoring. The brand-fit score now reports per-domain (general, regulatory, executive, customer-facing) instead of a single aggregate. A regression in any domain blocks promotion regardless of the aggregate.
- Editorial team pager added for the first 60 minutes after every brand-voice model promotion. Two senior editors review a stratified sample of new outputs in the first hour and have direct rollback authority. The pager rotates weekly.
What we are not changing.
We considered, and rejected, two changes that would have been overcorrections.
Excluding regulatory-compliance drafts from training entirely. The drafts were correct for their context, and the model needs to handle the context. The fix is balance, not exclusion.
Adding a manual approval gate on every fine-tune promotion. Fine-tune promotions happen weekly. A manual gate would slow the cycle meaningfully and shift incident risk from drift to delayed promotion. The stratification check, the per-domain rubric, and the post-promotion editor pager produce the same safety without the human-cycle cost.
What this incident demonstrated about the architecture.
Three things, two of which we knew and one of which we did not.
The editorial team is still the last line of defense. Automated eval is necessary and is not sufficient. The editorial team's calibration is more sensitive than any rubric we have built, and the editor-in-the-loop default is what makes the team's catches load-bearing instead of advisory.
Eval suites need to be stratified along the dimensions the model is expected to perform on. A single aggregate score hides regressions in specific dimensions. The per-domain rubric is the minimum viable instrumentation for catching the regressions that actually matter.
Training input curation is a first-class engineering concern, not a data-collection afterthought. Letting the training set be "whatever was editor-accepted in the trailing window" worked for many cycles. It failed when the trailing window was unrepresentative. The stratification check is a small piece of code that prevents the entire class of incident.
We publish this postmortem because the failure is structural to fine-tune-cycle architectures that pull training data from editorial telemetry. Any deployment doing this is one unrepresentative batch away from the same incident. The fixes are bounded and worth implementing before the incident, not after.