Every AI deployment has a rollback plan. The plan looks defensible in the design review. The first time it actually runs in a production incident, somewhere in the third minute of execution, somebody on the operations call says "we didn't think about this." The rollback works in the end, but the recovery time is longer than the plan promised, and the operations team learns several specific things about the architecture that the design review had abstracted away. The next rollback is faster because the team learned. The cost of the learning is the incident the rollback was supposed to bound.
Rollbacks under load have a different shape than the planning meeting suggests. They have to handle in-flight requests, partial state in caches, traffic mirroring, and the human-cycle realities of an operations team executing under pressure. The rollback patterns that work under load have specific properties that the rollback patterns presented in design reviews usually do not.
What follows is the rollback architecture we ship on every install, the failure modes we have eliminated by adopting it, and the specific operational practices that turn a five-minute rollback into a ninety-second rollback when an incident is in progress.
Property one: warm-keep the previous model.
The previous model version stays loaded in inference instances for at least one full release cycle after a promotion. The cost is the memory and compute footprint of running both models warm; the benefit is that the rollback is a routing change, not a deployment.
This is the most expensive of the rollback properties operationally — model footprints are large — and the one that separates rollbacks measured in seconds from rollbacks measured in minutes. Teams that cold-start the previous model when an incident requires rollback are committing themselves to whatever cold-start time the model has, which is typically two to ten minutes for a 70B model on production hardware. Two to ten minutes is the difference between bounded incident impact and significantly less bounded incident impact.
Property two: routing flip is atomic.
The rollback executes by flipping a routing config that all inference instances read. The flip is atomic — there is no partial state where some requests route to the new model and some to the old. The atomicity is enforced by reading the config from a strongly-consistent source (we use a small etcd cluster; teams using other primitives tend to land here too) and validating the read at every request boundary.
The non-atomic alternative — gradually shifting traffic from new to old — sounds safer and is operationally worse in an incident. During the gradient, output quality is non-deterministic from the user's perspective, the audit trail has to record a model-version-by-request rather than a model-version-by-time-window, and the operations team is debugging a system in two states. Atomic rollback fails one or two requests at the moment of flip and produces a clean architectural state on either side.
Property three: in-flight requests have defined behavior.
Requests that were issued against the new model and are still executing when the rollback fires must have defined behavior. The wrong default is to let them complete on the new model and then route subsequent requests to the old; this produces inconsistent output behavior across a transaction window and is hard to reason about during incident investigation.
The right default is to abort in-flight requests at the next checkpoint and re-issue them against the rolled-back model. This requires the workflow runtime to support checkpointed execution, which we ship by default. The cost is a small latency tax on the affected requests; the benefit is that every output post-rollback is from the same model, deterministically.
// Rollback executes against a versioned routing config
async function rollback(toVersion: ModelVersion, reason: string) {
// 1. Validate the target version is warm and ready
const ready = await modelRegistry.warmReady(toVersion);
if (!ready) throw new TargetNotWarm(toVersion);
// 2. Atomic config flip; readers see new version next request
await routing.flip({ to: toVersion, reason, by: actor });
// 3. Signal in-flight requests to abort at next checkpoint
await runtime.signalRollback({ activeAfter: toVersion });
// 4. Audit-log the rollback as a versioned event
await audit.log({
kind: "model.rollback",
target: toVersion,
reason,
actor,
affectedRequestCount: runtime.inFlightCount(),
});
}Property four: rollback is a regular drill, not an incident-only path.
The rollback path that has been exercised twenty times in non-incident drills works under incident pressure. The rollback path that has only ever been theoretical introduces failure modes the operations team discovers in real time. We run rollback drills monthly on a representative production workload — flip to the previous model, verify behavior, flip back. The drill catches regressions in the rollback path before an incident requires it.
The drills also calibrate the operations team. Knowing that the rollback completes in ninety seconds because you have done it twenty times is operationally different from believing it will complete in ninety seconds because the runbook says so. The first decision in an incident is whether to rollback, and that decision is faster when the team is calibrated.
Where this connects to the broader runtime.
Rollback architecture depends on observability. Without the trace and audit-trail layers, the operations team cannot quickly determine whether a rollback is the right action — they are debugging blind. Without a versioned corpus and versioned model registry, the rollback target is ambiguous. Without editor-in-the-loop as the default, the consequences of a delayed rollback are larger because more bad outputs reach downstream systems.
The full architecture is one thing, not five things. Rollback is the property that becomes load-bearing at the worst possible moment. The four-property pattern is what survives the moment.
If your current rollback plan has not been drilled in the last quarter, it is theoretical. The first incident will turn it into a learning exercise. The cost of running the drill before the incident is measured in dollars; the cost of learning during the incident is measured in customer impact. The arithmetic favors the drill.
There is one more property worth noting: the four-property pattern is composable with the rest of the operational architecture. Warm-keep dovetails with the model registry. Atomic routing flips dovetail with the trace data already required for audit. In-flight request handling dovetails with the workflow runtime's checkpoint support. Drills dovetail with the on-call rotation. None of these are net-new investments for an organization that has already built the broader operational architecture; they are extensions that compose with what is already there. That composability is what makes the pattern shippable inside a quarter rather than a year.