Postmortem: how a vendor outage killed twelve workflows we owned.

On February 4, 2026, a third-party API we depend on for a specific enrichment step had a two-hour-and-fourteen-minute outage. The outage took down twelve of our workflows. The twelve workflows are not obviously related to each other — they cover support triage, deal hygiene, content drafting, internal-knowledge retrieval, and four other unrelated categories. Their common dependency was a credential-resolution step that all of them, for historical reasons, routed through the same third-party API. The outage exposed a shared dependency we had stopped seeing because it had been added incrementally over fourteen months. We are publishing the postmortem because the dependency-accumulation pattern is more general than this specific vendor or this specific step.

Timeline.

16:08 UTC — Third-party API begins returning 503 across all endpoints. Vendor's status page acknowledges the issue at 16:14 UTC.
16:09 UTC — First workflow failure: support-triage. The workflow's authority-resolution step depends on the third-party API for credential validation.
16:11 UTC — Workflow failure cascade begins. By 16:14 UTC, six workflows have failed.
16:15 UTC — On-call engineer correlates the cluster of failures to the third-party API outage. Vendor status page confirms.
16:18 UTC — Twelve workflows confirmed affected. Three workflows remain healthy because they do not share the dependency.
16:22 UTC — Decision: do not attempt failover during the active incident. The third-party vendor's status page estimates 30-minute recovery. Risk of mid-failover state corruption exceeds the cost of the wait.
16:48 UTC — Vendor's first recovery estimate slips. Failover plan reactivated.
17:14 UTC — Third-party API returns to service. Workflows recover automatically as their next scheduled run completes.
18:22 UTC — All twelve workflows confirmed healthy. Backlog cleared. End of customer-facing impact.
Total elapsed: 2 hours 14 minutes. Affected workflow runs: 84. Affected outputs that the editor team had to clear from queue manually: 11.

Root cause.

Twelve of our workflows depended on the same third-party API for a credential-resolution step. The dependency was not visible in any single workflow's design — each workflow's authority-resolution called a shared utility function, which called the third-party API. The dependency had been added eighteen months earlier when the utility was created. As workflows were added over the following fourteen months, each one inherited the dependency without anyone explicitly choosing to add it.

When the third-party API went down, the twelve workflows failed simultaneously. There was no fallback path because the utility function had been written assuming the third-party API would be available. There was no dependency audit that would have surfaced the shared dependency because we had not been running one.

What worked.

Failure isolation. The twelve workflows failed cleanly. The runtime caught the API failure, marked each affected workflow run as failed-with-retry, and surfaced the failures to the operations team within seconds. No partial state was committed. No outputs shipped on stale data. The runtime's editor-in-the-loop default and the explicit failure-state semantics meant the failures were containable.

Rapid diagnosis. The on-call engineer correlated the cluster of failures to the shared API dependency in seven minutes. The trace data exposed the credential-resolution step as the common element across the twelve failure traces, and the third-party API endpoint was named explicitly in each trace.

Decision discipline on failover. The decision not to attempt mid-incident failover was correct. The vendor's recovery time was bounded. The cost of a partial failover with split-state would have been substantially higher than the cost of waiting. The decision was made within fifteen minutes of the incident and held.

What did not work.

No dependency audit had been run. The shared dependency was discoverable. We had not run the discovery. Eighteen months of incremental additions accumulated into a single point of failure that we had stopped seeing because none of the additions individually had been controversial.

No fallback path on the credential-resolution utility. The utility had been written with a single API target. Adding a fallback would have been a half-day of engineering at the time it was written. Adding a fallback retroactively, after the dependency had spread to twelve workflows, was a multi-week project with much higher risk.

The vendor's status page was the only signal we had on outage progress. We had no internal probe against the third-party API that would have given us early warning. The first signal of the outage was the workflow failures themselves, which means we were always at least one workflow run behind the actual outage start.

Architectural changes.

Quarterly dependency audit. We now run an automated audit of the shared dependencies across all production workflows. The audit produces a graph of which third-party APIs are depended on by which workflows, with the count and the criticality. Shared dependencies above a threshold trigger an architecture review for whether a fallback should be built.
Fallback path on the credential-resolution utility. The utility now supports two API targets, with automatic failover on 503/504. The fallback adds latency on the failover path; this is an acceptable cost for the failure isolation it provides.
Active probes against critical third-party dependencies. We now run a 60-second probe against each third-party API that more than three workflows depend on. The probe surfaces outage signal independent of workflow runs, so we have early warning before the first workflow fails.
Workflow-level dependency declarations. Each workflow definition now includes an explicit declaration of every third-party dependency it has, including transitive dependencies through utility functions. The declarations are validated against the dependency-audit graph; mismatches fail the workflow's CI.

What this incident demonstrated about the architecture.

Two architectural properties that compounded.

Failure isolation worked because the workflow runtime owned the failure semantics. A workflow runtime that defaulted to retrying failed steps with no escalation would have produced cascading failures and inconsistent state. The runtime's explicit failure-state model contained the blast radius to exactly the workflows whose dependency was down. We have written about why the workflow runtime has to own these semantics; this incident is a concrete example.

Trace data with explicit external-dependency annotations made diagnosis fast. Each workflow's trace included the named third-party API for every external call. The on-call engineer did not have to reverse-engineer the dependency from log timestamps; the trace data named it.

Implicit dependencies accumulate faster than the architecture team can track them by hand. This is the lesson we did not have before. The dependency audit is now part of the quarterly architecture review, with the same discipline as the eval suite and the versioned corpus reviews.

We publish this postmortem because the dependency-accumulation pattern is structural. Any organization adding workflows incrementally over months will accumulate shared dependencies that nobody chose explicitly. The audit discipline is the bounded fix. The first time we ran the new audit, post-incident, it surfaced two other shared dependencies we had not been aware of. Both have since been refactored.

D. ChoSTAFF ENGINEER · KNYTE

Built distributed retrieval at Pinterest and Databricks. Spends most days inside trace viewers and the rest writing about why your eval suite lies to you.

RECENT

Inference cost as a first-class engineering concern. →

RECENT

Multi-tenant fine-tunes without cross-tenant leakage. →

RECENT

Postmortem →

KEEP READING

More from the dispatch.

All posts →

FIG. 2026DURATION: 3H 24M

The eval suite passed. The parsers did not.

POSTMORTEMS

Postmortem: a model upgrade that flipped the citation style.

04.28.2610 MIN · JR

FIG. 2026DURATION: 38 MIN

Two indexer versions, one corpus, one promotion gap.

POSTMORTEMS

Postmortem: the day our queryable memory returned stale embeddings for forty minutes.

04.11.269 MIN · JR

FIG. 2026DURATION: 2H 11M

The queue was healthy. The throughput was dying.

POSTMORTEMS

Postmortem: a queue-saturation incident that wasn't the queue.

04.04.2611 MIN · DC

Postmortem: how a vendor outage killed twelve workflows we owned.

Timeline.

Root cause.

What worked.

What did not work.

Architectural changes.

What this incident demonstrated about the architecture.

More from the dispatch.

Postmortem: a model upgrade that flipped the citation style.

Postmortem: the day our queryable memory returned stale embeddings for forty minutes.

Postmortem: a queue-saturation incident that wasn't the queue.

Get the dispatch in your inbox.