Designing AI features that survive a model swap.

The most demoralizing meeting on a product calendar is the one held three months after an AI feature shipped, when the underlying model has been updated and nobody can quite explain why the feature is producing weirder outputs than it used to. The same prompts get different answers. The structured outputs the feature relied on come back malformed about three percent of the time. The eval suite passes, but the support tickets are climbing. The feature, in any practical sense, is broken — but in a way that is hard to articulate to leadership because nothing has technically failed.

This pattern is so consistent across product teams we have audited that we have started calling it the model-swap regression. It happens because most AI features were designed against a specific model's quirks, behaviors, and idiosyncrasies, even when the design team thought they were designing against capabilities. When the model changes, the design assumptions stop holding, and the feature becomes a slightly different feature that nobody intended to ship.

The features that survive a model swap have a different shape. They are designed against a contract — a specification of what the model is being asked to produce — rather than against the model itself. The contract is what the product owns. The model is what implements it. Swapping the model means re-implementing the contract; the user-visible behavior of the feature does not change.

What "the contract" actually is.

A feature contract has three parts. The input shape: what data the feature has access to and in what form. The output shape: what the feature produces and how downstream code consumes it. And the behavioral guarantees: which properties of the output the feature commits to maintaining regardless of which model implements them.

Behavioral guarantees are the part product teams most often skip. They are usually unstated, encoded only in the prompt's natural-language instructions, and discoverable only by inspecting the model's outputs. "Drafts must not include external links unless explicitly cited from the corpus." "Numerical claims must be sourced." "Outputs must be in the buyer's brand voice." These are the things that, when they break, make the feature feel different even if no specific output is wrong.

Writing the behavioral guarantees down — explicitly, in a contract document the product team owns — is the first half of designing for model swaps. Ensuring the eval suite tests every guarantee, not just the input/output shape, is the second half.

Three patterns that produce model-swap-fragile features.

After auditing a few dozen features that broke on model updates, three patterns produce most of the fragility.

Tone tuning by prompt phrasing. The product team discovered that a particular phrasing of the prompt produces outputs in a tone the team likes. The phrasing is not a contract; it is a workaround. When the model changes, the phrasing produces a different tone, and the team is back to iterating on prompt phrasings against the new model.

Format tuning by example. The prompt includes an example of the desired output format. The model learns to mimic the example. When the model changes, the mimicry is less exact, and downstream parsing breaks. The example is not a contract; it is a hope.

Behavior tuning by negative instruction. The prompt includes "do not include X." The current model respects the instruction; the new model respects it about ninety percent of the time. The exception cases are the ones that surface as support tickets. The negative instruction is not a contract; it is a request.

All three patterns share a property: they encode the desired behavior in the prompt rather than in the contract. When the model changes, the prompt has to be re-tuned. The feature's user-visible behavior depends on prompt-tuning labor that has to happen on the model's update schedule, not on the product's release schedule.

What contract-first design looks like.

A contract-first feature defines the input schema, the output schema, and the behavioral guarantees up front, in code that is independent of any specific model. The model selection becomes an implementation detail. The prompt becomes a serialization concern. The eval suite tests every behavioral guarantee against the current model and every model under evaluation. A model is promotable if and only if it satisfies the contract.

The shift in posture is consequential. The product team is no longer responsible for the prompt — the engineering team owns prompt serialization for the contract. The product team is responsible for the contract: which behavioral guarantees the feature commits to, which input data it consumes, what output downstream code can rely on. The conversation between product and engineering shifts from "is this prompt good" to "is this contract sufficient."

How the contract gets enforced.

Three enforcement points. The output schema is enforced at the model API level — modern APIs support structured outputs and validate against a JSON schema before returning. Outputs that do not conform to the schema get retried or surfaced as errors. The behavioral guarantees are enforced by the eval suite, which we covered in the eval architecture dispatch. The input schema is enforced by the type system, so the feature gets a type error if the calling code passes the wrong shape.

The eval-suite enforcement is the one most teams underinvest in. Behavioral guarantees are not testable by similarity metrics. They require editorial rubrics — explicit per-guarantee judgments by humans or by calibrated model graders. Setting up the rubric for each guarantee is real work. It is also the work that converts the contract from a hopeful document into a load-bearing one.

What this looks like in a real product team.

A typical contract-first product team has a feature spec that reads more like an API spec than a typical product brief. The user-facing description is short. The contract section is detailed: input schema, output schema, behavioral guarantees enumerated, eval rubric per guarantee, escalation path when a guarantee fails. The natural-language prompt is generated from the contract and reviewed by engineering.

Iteration happens against the contract, not the prompt. "The drafts are too long" becomes "add a length-bound guarantee to the contract and a rubric to enforce it." The work to satisfy the new guarantee may include prompt changes, but the prompt change is downstream of the contract change. When the model is upgraded, the contract is unchanged. The prompt may be re-serialized for the new model, the eval suite is re-run, and the feature ships if and only if the contract is still satisfied.

We run this pattern on every workflow on the Knyte automation page. The contracts are versioned, the rubrics are versioned, the prompts are generated from the contracts. When we promote a new model — which we do every few weeks — the user-facing features do not change unless the contract changed. The deployments that started this way have not produced a model-swap regression in the last six months.

Where to start.

Pick the highest-traffic AI feature in your product. Write its contract: input schema, output schema, behavioral guarantees as an enumerated list. Convert the existing prompt into a generator that takes the contract as input. Build an eval suite that tests every guarantee. Run the suite against the current model. Whatever number it produces is the new baseline.

When the next model upgrade is announced, the question stops being "will our feature break." It becomes "does the new model satisfy the contract." The eval suite answers it. Either the new model is promoted or it is not. The product team's roadmap does not depend on a model vendor's release cadence. That is the durable property of contract-first design, and it is the difference between a feature that survives the next eighteen months and one that ages into the model-swap regression nobody can quite articulate.

M. OkaforHEAD OF PRODUCT · KNYTE

Shipped the first multi-tenant editor-in-the-loop runtime at Notion. Now designs the surfaces operators actually use. Believes most AI products are toggles in search of a workflow.

RECENT

Onboarding for AI products →

RECENT

Designing for the ten percent of inputs that break. →

RECENT

The internal-tool antipattern in enterprise AI products. →

KEEP READING

More from the dispatch.

All posts →

FIG. 32DAYS 1–90

The first three days set the year.

PRODUCT

Onboarding for AI products: the first three days that determine the next year.

04.30.2611 MIN · MO

FIG. 33MEDIAN VS TAIL

The tail is where the tickets live.

PRODUCT

Designing for the ten percent of inputs that break.

04.24.2610 MIN · MO

FIG. 34DESIGN ARTIFACT MAP

The product still looks like the prototype.

PRODUCT

The internal-tool antipattern in enterprise AI products.

04.16.2610 MIN · MO

Designing AI features that survive a model swap.

What "the contract" actually is.

Three patterns that produce model-swap-fragile features.

What contract-first design looks like.

How the contract gets enforced.

What this looks like in a real product team.

Where to start.

More from the dispatch.

Onboarding for AI products: the first three days that determine the next year.

Designing for the ten percent of inputs that break.

The internal-tool antipattern in enterprise AI products.

Get the dispatch in your inbox.