The eval-trust gap: why AI deployments fail QBR before they fail production.

An AI deployment can be working — operators using it daily, output volume up, editorial sign-off rates climbing — and still fail at the QBR. The failure is not technical. The deployment performs. The failure is evidentiary: the team running it cannot produce the kind of artifact a board can read, verify, and incorporate into a fiduciary decision. The operators trust the system because they see it work. The board distrusts it because they cannot see what the operators see, and the slides being shown to them substitute hand-wave for proof. We have started calling this the eval-trust gap, and it has killed more deployments in the cohorts we audit than any technical failure mode we have catalogued.

The gap is closeable. Closing it is not a different deployment; it is a different reporting layer on top of the same deployment. Three artifacts, in our experience, do most of the work — and most teams running AI deployments are not producing any of them with the discipline that lets a board treat them as evidence rather than as marketing.

Artifact one: the replaced-line-item table.

We covered this frame at length in the productivity-uplift dispatch. The table is a four-column document — workflow, prior cost, current cost, delta — where every row maps to a specific line item in the operating budget. The board can verify it because the finance team can confirm the deltas. There is no productivity number to interpret. The table is what it claims to be.

Most teams skip this artifact because building it requires sitting down with finance for two days and tagging operating-budget lines to workflows. The two days is the cost of admission to a defensible AI program. Teams that have not paid it are running deployments that the board has every reason to be skeptical of.

Artifact two: the compounding-output curve.

The curve we wrote about here — output per team-member-month, indexed to month one, holding headcount constant. The board reads the shape of the curve, not the specific number. A curve that bends upward through month nine is evidence the deployment is genuinely architecture. A curve that plateaus at month four is evidence it is a feature. The board can interpret the shape without needing to understand the underlying technology, which is exactly what makes it useful as evidence.

Producing the curve requires the headcount and workflow definitions to be stable enough that the trend is interpretable. Many AI programs reorganize themselves so frequently that the curve is uninterpretable, which is itself a board-level signal — the deployment is not yet stable enough to evaluate, and the board should be told so explicitly.

Artifact three: the audit-grade incident log.

The artifact most teams forget. The incident log enumerates every editorial catch, every model rollback, every compliance-relevant event over the reporting period. The log is not a confession. It is evidence the editorial gate is doing what it claims to do. A board reading a clean incident log knows the deployment is being managed; a board reading a deployment with no incident log assumes either nothing is happening (unlikely) or the team is hiding it (more likely than the team realizes).

The incident log is the artifact that requires the audit-trail observability we wrote about separately. Without it, the log is reconstruction; with it, the log is generated from runtime telemetry and signed at the time the events occurred.

What the gap actually costs.

The eval-trust gap has three downstream costs that compound.

Renewal friction. A board that distrusts the deployment delays the renewal conversation, which slows the architectural investment, which slows the compounding curve. The deployment underperforms what it could have been because it was being defended rather than expanded.

Talent attrition. Operators who can see the deployment work are demoralized when they sense the board does not. The most engaged users — the ones whose editorial signal is most valuable to the compounding curve — are the first to disengage when the program is perceived as unstable. The compounding curve loses the calibration it most depended on. The team that built the program also begins to look for roles where the program they own is not under perpetual defense. Both attritions compound the eval-trust gap they originate from.

Procurement reset. When the gap is not closed, the next CIO or CFO will treat the program as a sunk cost and start over with a new vendor. The corpus, the workflows, the editorial signal — everything the deployment had built — is reset. The eighteen-month half-life we wrote about here is sometimes the vendor's fault and sometimes the buyer's evidence problem.

Three diagnostics for whether your deployment has the gap.

If the QBR slide deck includes a productivity-uplift number that is not tied to specific operating-budget lines, the gap is open. Replace the number with a line-item table.
If the deployment cannot produce a compounding-output curve at the workflow level because headcount or workflow definitions have shifted, the gap is open. Stabilize the program for at least one quarter so the curve becomes interpretable.
If the most recent material editorial catch or model rollback is not in a written incident log that the board has seen, the gap is open. Generate the log from your audit-trail telemetry and ship it next QBR.

We see deployments cross the trust threshold within one QBR cycle of producing the three artifacts in the right form. The technical work is mostly there already. The reporting layer is what is missing. Closing the gap is the difference between a deployment that compounds and a deployment that is being defended at every renewal.

A. VasquezPRINCIPAL THESIS · KNYTE

Former CFO at three growth-stage SaaS companies. Writes the replacement-math frame the Knyte team uses on every architecture call. Stanford GSB; CPA.

RECENT

The renewal trap →

RECENT

The data-classification audit you should have run last year. →

RECENT

The institutional memory tax →

KEEP READING

More from the dispatch.

All posts →

FIG. 232024 → 2027

Strategy slides are being replaced by architecture maps.

THESIS

What replaces "AI strategy" in 2027 board decks.

04.25.2610 MIN · TJ

FIG. 01N = 41 PORTFOLIOS

RENEWAL CYCLE

Year three is when the bill exceeds the output.

THESIS

The renewal trap: how multi-vendor AI subscriptions evaporate by year three.

04.22.2614 MIN · AV

FIG. 2423 ENTERPRISE AUDITS

The classification scheme is older than the AI program.

THESIS

The data-classification audit you should have run last year.

04.21.2612 MIN · AV

The eval-trust gap: why AI deployments fail QBR before they fail production.

Artifact one: the replaced-line-item table.

Artifact two: the compounding-output curve.

Artifact three: the audit-grade incident log.

What the gap actually costs.

Three diagnostics for whether your deployment has the gap.

More from the dispatch.

What replaces "AI strategy" in 2027 board decks.

The renewal trap: how multi-vendor AI subscriptions evaporate by year three.

The data-classification audit you should have run last year.

Get the dispatch in your inbox.