Skip to main content
POSTMORTEMS

Postmortem: a model upgrade that flipped the citation style.

A scheduled model promotion changed the citation format the model used by default. The change passed the eval suite. The downstream parsers did not. Here is what we learned about the gap between syntactic and semantic eval.

By J. ReichertPRINCIPAL ENGINEER · KNYTE
PUBLISHEDAPRIL 28, 2026
READ TIME10 MIN
CATEGORYPOSTMORTEMS

On April 12, 2026, a scheduled model promotion altered the default citation format the model produced. The change passed our eval suite — every editorial rubric scored within noise of the previous model. It did not pass our downstream parsers, which had been built against the previous citation format and silently failed on the new format. The failure was silent because the parsers continued to return citation objects; the objects were structurally valid and pointed at the wrong corpus documents. The silence is what made the incident long. We caught it three hours and twenty-four minutes after the promotion, when an editor noticed that a generated draft cited a document the editor knew should not have been in scope.

Timeline.

  • 08:00 UTC — Scheduled model promotion (v3.4 → v3.5). Eval suite passes. Editorial rubrics within noise. Promotion approved.
  • 08:14 UTC — New model serves production. Citation format changes from `[doc-id:chunk-id]` to `[doc-id]:[chunk-id]` — same data, different syntax.
  • 08:14 UTC – 11:38 UTC — Downstream citation parsers, built against the prior format, parse the new format incorrectly. The parsers extract a doc-id that is the literal string `[doc-id]` and a chunk-id that includes the closing bracket. Both are syntactically valid identifiers in the corpus's namespace. The parsers return citation objects that point at non-existent documents but pass validation.
  • 08:14 UTC – 11:38 UTC — 247 production drafts generated with mis-parsed citations. The drafts themselves contain correct content; the citation links underneath them point at the wrong documents.
  • 11:38 UTC — Editor reviewing a draft notices a cited document that should not have been in scope. Investigates by clicking the citation link. Lands on a different document than the citation text suggests. Reports to operations channel.
  • 11:42 UTC — On-call engineer reproduces. Identifies parser-format mismatch.
  • 11:51 UTC — Decision: rollback to v3.4 while we patch the parsers and decide whether to re-promote v3.5 with the parsers updated, or pin v3.4 with the model team adjusting v3.5's citation defaults.
  • 11:53 UTC — v3.4 in production. Citation format reverts. Parsers operate correctly.
  • 11:58 UTC — Incident contained. Backfill effort begins for the 247 affected drafts: re-generate against v3.4 to restore correct citations, audit the affected drafts, notify the affected workflows.
  • Total elapsed window of incorrect citations: 3 hours 24 minutes. Affected drafts: 247. Drafts that shipped to downstream systems with incorrect citations: 12. Drafts that the editor team was able to catch and correct before downstream emit: 235.

Root cause.

Two layers, both incorrect in different ways. The model's new training data included examples with the new citation format, and the model learned to prefer it. The eval suite scored generation quality, not citation parsability — the generated text was high-quality with either format, so the eval rubrics did not catch the change. The downstream parsers were built against the prior format and assumed it as a contract. The contract was implicit and was not part of any test the model promotion ran against.

The combination is what produced the silent failure. The model's behavior changed in a way the eval suite did not measure, the parsers' behavior changed in a way that produced syntactically valid but semantically wrong outputs, and there was no integration check that would have caught the mismatch.

What worked.

Editor caught it. The 247 affected drafts went through editorial review per the editor-in-the-loop default. The editor caught the citation issue on a draft within the first batch they reviewed under the new model. This is the value of editor-in-the-loop on outputs the eval suite cannot fully verify.

Trace data made the diagnosis fast. The on-call engineer pulled the trace for the affected draft, saw the citation in the model's raw output, compared to the parsed citation, and identified the format mismatch in under fifteen minutes. Without the trace layer, the diagnosis would have required reproducing the model's behavior manually.

Rollback was clean. The previous model was warm. The routing flip executed in 90 seconds. The 235 drafts not yet downstream-emitted were re-generated against v3.4 within the next two hours. The downstream-emitted 12 were investigated individually; in 11 cases the editorial workflow caught the citation issue downstream and re-issued; in 1 case the citation issue was real and required a customer-facing correction.

What did not work.

The eval suite did not test citation parseability. The suite measured generation quality, editorial rubrics, and aggregate quality scores. It did not measure whether the output's citations could be correctly parsed by the downstream system. The integration contract between model output and downstream parser was implicit and untested.

The parsers failed silently. The parsers received an unexpected input format and did not detect that anything was wrong. They returned syntactically valid citation objects pointing at non-existent documents, and downstream code did not validate the existence of the cited documents. The combination meant the failure had no visible signal until an editor noticed.

Architectural changes.

  1. Eval suite extended with citation-parseability tests. Every test case now includes a check that the model's citations can be parsed and resolved to existing corpus documents. A model that produces citations the parser cannot resolve fails the eval, regardless of generation quality.
  2. Parsers now validate citation resolution. Parsers that produce citation objects must verify the cited documents exist in the corpus before returning. Citations that resolve to non-existent documents trigger an error, not a silent return.
  3. Integration contracts between model and downstream consumers are now explicit. Each downstream consumer of model output declares the format contract it requires. The eval suite runs the contract checks alongside the editorial rubrics. Format changes that break a contract block promotion.

What this incident demonstrated.

The eval suite that we wrote about in the eval-suite dispatch is necessary but not sufficient. Production eval has to include integration tests for every downstream contract the model output is implicitly serving. Generation quality and downstream consumability are separate dimensions, and a model can regress on one without regressing on the other.

The deeper lesson is that implicit contracts accumulate silently. The citation format was a contract that nobody had written down. The downstream parsers depended on it. The model team did not know about it. The promotion changed the contract without anyone noticing because the contract was not visible. Making implicit contracts explicit is a discipline that prevents this entire class of incident, and the discipline is bounded once the architecture supports declared contracts as a first-class object.

If your AI deployment has model outputs that downstream systems consume, every consumer-implicit contract is a candidate for the next silent regression. The audit is bounded; the alternative is a postmortem with a similar shape to this one.

J. ReichertPRINCIPAL ENGINEER · KNYTE

Twelve years on production retrieval and inference systems. Previously at Stripe (risk infra) and Anthropic (eval tooling). Writes about the boring parts of agentic infra.

SUBSCRIBE

Get the dispatch in your inbox.

Twice a month. We send the essay, the postmortem, and nothing else. No roundups. No tracking pixels pretending to be personalization.

NO SPAM · UNSUBSCRIBE ANYTIME · 4,200 READERS