Skip to main content
POSTMORTEMS

Postmortem: an SSO scope change that paged twelve workflows in seven minutes.

A routine SSO scope cleanup invalidated a credential that twelve workflows were silently sharing. The pages started seven minutes after the change. Here is the timeline, the credential audit we now run, and the runbook addition that prevents the recurrence.

By J. ReichertPRINCIPAL ENGINEER · KNYTE
PUBLISHEDJANUARY 15, 2026
READ TIME9 MIN
CATEGORYPOSTMORTEMS

On December 9, 2025, an SSO scope cleanup removed a permission that twelve of our workflows were silently depending on. The pages started seven minutes after the change went live. Total elapsed time from incident start to full recovery was forty-seven minutes. The customer-facing impact was bounded — the workflows failed cleanly rather than producing bad outputs — but the incident exposed a credential-audit gap that we have since closed. We are publishing the postmortem because the SSO-scope-cleanup pattern is one of the most common categories of self-inflicted incident in enterprise AI deployments, and the gap we had is structural to most deployments we have audited.

Timeline.

  • 14:00 UTC — Scheduled SSO scope cleanup begins. The cleanup removes scopes that have not been used in the trailing 30 days from the AI-deployment service principal. Audit logs show the principal has not invoked the removed scopes in the trailing 30 days. The cleanup is approved as low-risk.
  • 14:02 UTC — Scope removal executes. The principal loses the scope.
  • 14:09 UTC — First workflow run that requires the removed scope executes. The runtime returns a 403 from the downstream system. The workflow's runtime catches the failure and pages the on-call engineer.
  • 14:11 UTC — Three more workflows fail. Pager noise indicates a cluster.
  • 14:13 UTC — On-call engineer correlates the failures to the SSO scope cleanup. The engineer pulls the cleanup log and confirms the removed scope is the one the workflows were using.
  • 14:17 UTC — Decision: re-grant the scope and investigate why the audit-based cleanup logic missed the dependency. The re-grant is approved by the security-on-call within four minutes.
  • 14:21 UTC — Scope re-granted. Workflow runs that were paged for retry begin succeeding within 90 seconds.
  • 14:47 UTC — All twelve affected workflows confirmed healthy. Backlog cleared. End of incident.
  • Total elapsed: 47 minutes. Affected runs: 31. Affected outputs that the editor team had to clear from queue manually: 6.

Root cause.

The SSO cleanup logic decided which scopes to remove based on whether the service principal had invoked them in the trailing thirty days. The logic was correct as written; the principal had not invoked the removed scopes in that window. The reason it had not was that the affected workflows ran on a quarterly schedule, with the next scheduled run two days after the cleanup. The cleanup removed a scope that was due to be used two days later.

The cleanup logic had no awareness of scheduled future use. It assumed that thirty-day non-use was a sufficient proxy for unused. For the twelve quarterly workflows, the proxy was wrong.

What worked.

Workflows failed cleanly. The runtime caught the 403 from each failed call, marked the run as failed-with-retry, and paged on-call. No partial state was committed. No outputs shipped on stale or unauthorized data.

Diagnosis was fast. The on-call engineer correlated the failure cluster to the SSO cleanup in four minutes because the cleanup had been logged in the operations channel and the failure traces named the scope.

Re-grant was fast. The security-on-call approved the re-grant within four minutes of the request. The scope was back within sixteen minutes of the cleanup that had removed it.

What did not work.

The cleanup logic had no awareness of scheduled workflow runs. The audit was based on past use, not on declared dependency. The twelve quarterly workflows had explicit declarations of which scopes they required, but the cleanup logic was not consulting the declarations.

The change was approved as low-risk on the basis of incomplete information. The reviewer who approved the cleanup saw the audit data and the trailing-30-day non-use and approved on that basis. The reviewer did not have visibility into the workflow-dependency declarations because the dependency declarations were in a different system from the SSO audit dashboard.

Process changes.

  1. SSO scope-cleanup logic now reads the workflow-dependency declarations as part of the cleanup decision. A scope that is in any workflow's declared dependency set is not removed regardless of trailing-window use. The change reduced the scope-cleanup volume by about 20% and eliminates the entire incident class.
  2. Scope-cleanup approval workflow now requires the reviewer to confirm that no declared workflow dependency would be invalidated by the removal. The workflow-dependency check is automated and surfaces in the same UI as the audit data, so the reviewer has the full picture.
  3. Workflow-dependency declarations are now validated in CI. Any workflow that uses a scope not in its declared dependency set fails CI. This closes the gap where a workflow's declared dependencies could drift from its actual dependencies.

What this incident demonstrated about the architecture.

Three things.

Failure-state cleanliness paid for itself. The workflows failed without producing partial outputs because the runtime owned the failure semantics. We have written about why the workflow runtime needs to own failure-state explicitly; this incident is a textbook example of the architectural property earning its keep.

Cross-system audits depend on cross-system visibility. The SSO cleanup logic and the workflow-dependency declarations were in different systems. The cleanup logic was correct in its system; the gap was the absence of a check across systems. Most enterprise AI incidents we have investigated have this shape — the failure is at the intersection of two systems each of which is locally correct.

Quarterly workflows are an underweighted source of incident risk. The trailing-30-day audit failed because the workflow ran quarterly. Quarterly workflows are common in enterprise AI deployments — board-cycle workflows, audit-cycle workflows, planning-cycle workflows — and they are the workflows most likely to be invisible to the typical operational instrumentation. The dependency-audit discipline has to extend to them explicitly.

We publish this postmortem because the cross-system-audit gap is the most common category of self-inflicted enterprise AI incident in our experience. The fix is not heroic. It is a small piece of additional check in the cleanup logic that consults the dependency declarations. The reason most deployments do not have this check is not that it is hard. It is that the gap is not visible until the incident produces it.

J. ReichertPRINCIPAL ENGINEER · KNYTE

Twelve years on production retrieval and inference systems. Previously at Stripe (risk infra) and Anthropic (eval tooling). Writes about the boring parts of agentic infra.

SUBSCRIBE

Get the dispatch in your inbox.

Twice a month. We send the essay, the postmortem, and nothing else. No roundups. No tracking pixels pretending to be personalization.

NO SPAM · UNSUBSCRIBE ANYTIME · 4,200 READERS