Postmortem: a queue-saturation incident that wasn't the queue.

On March 29, 2026, workflow throughput dropped by roughly seventy percent over two hours and stayed depressed for another fourteen hours after we believed the incident had been resolved. Every dashboard pointed at the queue: queue depth was rising, processing rate was falling, the obvious diagnosis was queue saturation. The on-call engineer scaled up queue workers. The queue absorbed the additional capacity and continued to fall behind. The queue was not actually the bottleneck. We diagnosed the real cause two hours and eleven minutes into the incident, after a piece of fortunate observation by an editor who noticed something the dashboard did not surface. We are publishing the postmortem because the diagnostic pattern — "the obvious bottleneck is not the bottleneck" — is one we have now seen often enough to write a discipline around it.

Timeline.

13:08 UTC — Workflow throughput dashboard shows a 30% drop over the trailing 15 minutes. Queue depth begins rising.
13:14 UTC — Throughput drop reaches 50%. Queue depth doubles. On-call engineer paged.
13:18 UTC — Engineer scales queue workers from 8 to 16. Queue depth continues to grow.
13:24 UTC — Throughput drop reaches 70%. Engineer scales queue workers to 32. No improvement.
13:38 UTC — Engineer escalates. Second engineer joins. Hypothesis review: queue depth is rising but worker scale is not improving throughput. Either workers are blocked downstream or the queue is not actually the bottleneck.
14:14 UTC — Editor reviewing a flagged draft notices that retrieval results are returning 50% fewer chunks than usual. Reports to operations channel.
14:18 UTC — Engineering team correlates. The retrieval layer is producing reduced result sets. The model is generating outputs against thinner contexts, which is producing more editor-flagged drafts, which is causing the queue to back up because each draft requires editor review at higher rates than usual.
14:33 UTC — Root cause identified at the retrieval layer: an embedding cache had been silently degrading after a network partition between the retrieval service and the cache backend. The cache was returning stale entries, including some that had been deleted from the corpus.
14:48 UTC — Cache flushed and reconnected to the corpus's current state. Retrieval result sizes return to normal. Throughput begins recovering.
15:19 UTC — Throughput restored to baseline. End of acute incident.
15:19 UTC – 05:14 UTC (next day) — Editor queue backlog from the depressed-throughput period works through. Operationally healthy by next morning.
Total acute window: 2h 11m. Total recovery: 16h 6m.

Root cause.

An embedding cache backend had a network partition with the retrieval service two hours before the incident became visible. The partition healed but the cache's view of the corpus was now stale; entries that had been deleted or updated post-partition were not invalidated correctly. Retrieval queries against the stale cache returned reduced result sets — fewer chunks, mixed with some chunks that no longer existed in the canonical corpus.

The model, given thinner retrieval context, produced lower-quality outputs. The editor team caught the lower quality and flagged drafts at a higher rate than usual. The flagged drafts queued for editor review faster than the editor team could process them. The queue grew. The queue depth was a symptom; the cache was the cause.

What worked.

Editor observation surfaced the real cause. The editor's note about reduced retrieval result sizes was the diagnostic that broke the case. The editor was not part of the on-call rotation. The operations channel where the note landed was the surface that connected the editor's observation to the engineering investigation. We have written about the editor-in-the-loop pattern as a quality control. This incident illustrates a second function: editors are also a diagnostic surface.

Trace data confirmed quickly. Once the hypothesis was retrieval rather than queue, the trace data on retrieval calls confirmed within minutes. Result set sizes were measurably reduced; the affected calls were identifiable; the cache backend's freshness signal could be validated independently.

What did not work.

The dashboard pointed at the wrong layer. Queue depth was the most visible signal because queue depth is one of the metrics the dashboard surfaces prominently. Retrieval result-set size was not surfaced prominently. The on-call engineer reasonably acted on the most visible signal, which was the wrong signal. The dashboard's prominence layout was implicitly biasing the diagnosis.

The cache backend's freshness was not monitored. The cache had a freshness signal — the timestamp of its last successful sync with the corpus — but the signal was not in the operations dashboard. The signal would have caught the issue within minutes of the partition. The decision not to surface the signal was a small, defensible call at the time it was made; the call accumulated into a multi-hour incident.

Worker-scaling masked the symptom. Scaling queue workers added capacity that the queue could absorb, which made the queue depth appear to stabilize for short windows. The stabilization was misleading; the underlying throughput problem was unchanged. The engineer's instinct to scale the obvious bottleneck made the diagnosis harder rather than easier.

Architectural changes.

Cache freshness as a first-class operational metric. Every cache layer in the retrieval pipeline now reports its sync-staleness as a top-line dashboard metric. Cache layers more than 60 seconds out of sync with the corpus trigger a paging alert.
Retrieval result-set size as a workflow metric. The dashboard now plots median retrieval result set size per workflow over time. Drops below the trailing-week median trigger an investigation page even before any downstream impact has been observed.
Editor observations have a structured channel. The operations channel is no longer the primary path for editor-team observations of system anomalies. We have added a structured observation form that the editor team can submit, which routes to the on-call queue with the editor's specific context. The latency between editor observation and engineering visibility is now measured in seconds, not in however long it takes someone to read a Slack channel.

What this incident demonstrated.

The diagnostic discipline that prevents this incident class is to ask, before scaling the obvious bottleneck, whether the bottleneck is the cause or a symptom. The pattern is sometimes called "the obvious bottleneck is not the bottleneck" and it is more general than this incident. The applied form is: when scaling the obvious bottleneck does not improve throughput, the bottleneck is downstream of the visible signal.

We have also learned to weight editor observations more heavily in incident diagnosis. The editor team is closer to the output quality than the engineering team and notices specific quality regressions earlier than the dashboards do. Building structured channels for editor observations was a small operational investment that has already paid back on this incident and on two others since.

If your operations dashboard has queue depth as a prominent metric and cache freshness as a buried one, the next incident in this class is queued up. The architectural change is small — surface the freshness signal, plot the result-set size, structure the editor observation channel. The next incident in this shape resolves in twenty minutes instead of two hours.

D. ChoSTAFF ENGINEER · KNYTE

Built distributed retrieval at Pinterest and Databricks. Spends most days inside trace viewers and the rest writing about why your eval suite lies to you.

RECENT

Inference cost as a first-class engineering concern. →

RECENT

Multi-tenant fine-tunes without cross-tenant leakage. →

RECENT

Why your RAG stack needs a versioned corpus, not a vector store. →

KEEP READING

More from the dispatch.

All posts →

FIG. 2026DURATION: 3H 24M

The eval suite passed. The parsers did not.

POSTMORTEMS

Postmortem: a model upgrade that flipped the citation style.

04.28.2610 MIN · JR

FIG. 2026DURATION: 38 MIN

Two indexer versions, one corpus, one promotion gap.

POSTMORTEMS

Postmortem: the day our queryable memory returned stale embeddings for forty minutes.

04.11.269 MIN · JR

FIG. 2026DURATION: 6 HOURS

Six outlier examples, one shifted voice, two rollbacks.

POSTMORTEMS

Postmortem: the day a fine-tune corrupted our brand-voice index.

03.26.2610 MIN · JR

Postmortem: a queue-saturation incident that wasn't the queue.

Timeline.

Root cause.

What worked.

What did not work.

Architectural changes.

What this incident demonstrated.

More from the dispatch.

Postmortem: a model upgrade that flipped the citation style.

Postmortem: the day our queryable memory returned stale embeddings for forty minutes.

Postmortem: the day a fine-tune corrupted our brand-voice index.

Get the dispatch in your inbox.