Postmortem: an SSO scope change that paged twelve workflows in seven minutes.

On December 9, 2025, an SSO scope cleanup removed a permission that twelve of our workflows were silently depending on. The pages started seven minutes after the change went live. Total elapsed time from incident start to full recovery was forty-seven minutes. The customer-facing impact was bounded — the workflows failed cleanly rather than producing bad outputs — but the incident exposed a credential-audit gap that we have since closed. We are publishing the postmortem because the SSO-scope-cleanup pattern is one of the most common categories of self-inflicted incident in enterprise AI deployments, and the gap we had is structural to most deployments we have audited.

Timeline.

14:00 UTC — Scheduled SSO scope cleanup begins. The cleanup removes scopes that have not been used in the trailing 30 days from the AI-deployment service principal. Audit logs show the principal has not invoked the removed scopes in the trailing 30 days. The cleanup is approved as low-risk.
14:02 UTC — Scope removal executes. The principal loses the scope.
14:09 UTC — First workflow run that requires the removed scope executes. The runtime returns a 403 from the downstream system. The workflow's runtime catches the failure and pages the on-call engineer.
14:11 UTC — Three more workflows fail. Pager noise indicates a cluster.
14:13 UTC — On-call engineer correlates the failures to the SSO scope cleanup. The engineer pulls the cleanup log and confirms the removed scope is the one the workflows were using.
14:17 UTC — Decision: re-grant the scope and investigate why the audit-based cleanup logic missed the dependency. The re-grant is approved by the security-on-call within four minutes.
14:21 UTC — Scope re-granted. Workflow runs that were paged for retry begin succeeding within 90 seconds.
14:47 UTC — All twelve affected workflows confirmed healthy. Backlog cleared. End of incident.
Total elapsed: 47 minutes. Affected runs: 31. Affected outputs that the editor team had to clear from queue manually: 6.

Root cause.

The SSO cleanup logic decided which scopes to remove based on whether the service principal had invoked them in the trailing thirty days. The logic was correct as written; the principal had not invoked the removed scopes in that window. The reason it had not was that the affected workflows ran on a quarterly schedule, with the next scheduled run two days after the cleanup. The cleanup removed a scope that was due to be used two days later.

The cleanup logic had no awareness of scheduled future use. It assumed that thirty-day non-use was a sufficient proxy for unused. For the twelve quarterly workflows, the proxy was wrong.

What worked.

Workflows failed cleanly. The runtime caught the 403 from each failed call, marked the run as failed-with-retry, and paged on-call. No partial state was committed. No outputs shipped on stale or unauthorized data.

Diagnosis was fast. The on-call engineer correlated the failure cluster to the SSO cleanup in four minutes because the cleanup had been logged in the operations channel and the failure traces named the scope.

Re-grant was fast. The security-on-call approved the re-grant within four minutes of the request. The scope was back within sixteen minutes of the cleanup that had removed it.

What did not work.

The cleanup logic had no awareness of scheduled workflow runs. The audit was based on past use, not on declared dependency. The twelve quarterly workflows had explicit declarations of which scopes they required, but the cleanup logic was not consulting the declarations.

The change was approved as low-risk on the basis of incomplete information. The reviewer who approved the cleanup saw the audit data and the trailing-30-day non-use and approved on that basis. The reviewer did not have visibility into the workflow-dependency declarations because the dependency declarations were in a different system from the SSO audit dashboard.

Process changes.

SSO scope-cleanup logic now reads the workflow-dependency declarations as part of the cleanup decision. A scope that is in any workflow's declared dependency set is not removed regardless of trailing-window use. The change reduced the scope-cleanup volume by about 20% and eliminates the entire incident class.
Scope-cleanup approval workflow now requires the reviewer to confirm that no declared workflow dependency would be invalidated by the removal. The workflow-dependency check is automated and surfaces in the same UI as the audit data, so the reviewer has the full picture.
Workflow-dependency declarations are now validated in CI. Any workflow that uses a scope not in its declared dependency set fails CI. This closes the gap where a workflow's declared dependencies could drift from its actual dependencies.

What this incident demonstrated about the architecture.

Three things.

Failure-state cleanliness paid for itself. The workflows failed without producing partial outputs because the runtime owned the failure semantics. We have written about why the workflow runtime needs to own failure-state explicitly; this incident is a textbook example of the architectural property earning its keep.

Cross-system audits depend on cross-system visibility. The SSO cleanup logic and the workflow-dependency declarations were in different systems. The cleanup logic was correct in its system; the gap was the absence of a check across systems. Most enterprise AI incidents we have investigated have this shape — the failure is at the intersection of two systems each of which is locally correct.

Quarterly workflows are an underweighted source of incident risk. The trailing-30-day audit failed because the workflow ran quarterly. Quarterly workflows are common in enterprise AI deployments — board-cycle workflows, audit-cycle workflows, planning-cycle workflows — and they are the workflows most likely to be invisible to the typical operational instrumentation. The dependency-audit discipline has to extend to them explicitly.

We publish this postmortem because the cross-system-audit gap is the most common category of self-inflicted enterprise AI incident in our experience. The fix is not heroic. It is a small piece of additional check in the cleanup logic that consults the dependency declarations. The reason most deployments do not have this check is not that it is hard. It is that the gap is not visible until the incident produces it.

J. ReichertPRINCIPAL ENGINEER · KNYTE

Twelve years on production retrieval and inference systems. Previously at Stripe (risk infra) and Anthropic (eval tooling). Writes about the boring parts of agentic infra.

RECENT

Postmortem →

RECENT

Streaming model outputs without losing the editorial gate. →

RECENT

Designing queryable memory for enterprise retrieval pipelines. →

KEEP READING

More from the dispatch.

All posts →

FIG. 2026DURATION: 3H 24M

The eval suite passed. The parsers did not.

POSTMORTEMS

Postmortem: a model upgrade that flipped the citation style.

04.28.2610 MIN · JR

FIG. 2026DURATION: 38 MIN

Two indexer versions, one corpus, one promotion gap.

POSTMORTEMS

Postmortem: the day our queryable memory returned stale embeddings for forty minutes.

04.11.269 MIN · JR

FIG. 2026DURATION: 2H 11M

The queue was healthy. The throughput was dying.

POSTMORTEMS

Postmortem: a queue-saturation incident that wasn't the queue.

04.04.2611 MIN · DC

Postmortem: an SSO scope change that paged twelve workflows in seven minutes.

Timeline.

Root cause.

What worked.

What did not work.

Process changes.

What this incident demonstrated about the architecture.

More from the dispatch.

Postmortem: a model upgrade that flipped the citation style.

Postmortem: the day our queryable memory returned stale embeddings for forty minutes.

Postmortem: a queue-saturation incident that wasn't the queue.

Get the dispatch in your inbox.