Compounding output is the only AI ROI metric that survives an audit.

Most AI metrics flatter the deployment. Engagement looks healthy because the team uses the tool every day. Adoption looks healthy because the seat count has grown. Productivity uplift looks healthy because the survey numbers came back optimistic. The deployment can post green metrics in every category and still be quietly failing — failing in the specific sense that month nine produces the same output as month one with the same headcount.

Compounding output is the metric that breaks if the deployment is not actually working. It is the ratio of useful output volume in month N to month one, holding headcount constant. If the deployment is genuinely architecture, the curve bends upward as queryable memory grows, editor-in-the-loop corrections accumulate, and the model fine-tunes converge on the buyer's specific judgment. If the deployment is a feature, the curve plateaus around month four and stays there. The shape of the curve is diagnostic in a way none of the engagement metrics are.

We use compounding output as the headline metric on every install we ship. Not because it is the most flattering — sometimes it is brutally honest — but because it is the only metric that survives a board-level audit when somebody asks the question that matters: are we building a capability or are we paying a subscription that pretends to be one.

How to measure it without lying to yourself.

Three definitions have to be nailed down before the metric is meaningful. Skipping any of them produces a number that looks like compounding output but functions like engagement.

01. Useful output, defined per workflow.

Useful output is workflow-specific. For a content workflow, it is editorial-approved drafts. For a support workflow, it is resolved tickets. For a renewal workflow, it is briefs that were used by the AE in the renewal call. The definition has to be specific enough that an editor or operator can mark each output as useful or not. If you cannot mark it, you cannot count it.

Most measurement frameworks fail here. They count outputs generated, not outputs used. The gap between those two numbers is the lie. We have audited deployments where the output-generated curve compounded beautifully and the output-used curve was flat. The deployment was producing more, and none of it was being shipped.

02. Headcount, held constant by team.

Compounding is meaningful only if the team is not growing into the curve. If the workflow's output doubled and the team running it also doubled, the deployment did not compound. It scaled, which is a different and less interesting result. The metric requires fixing headcount at the team level — not the company level, since reorganizations will obscure the signal — and tracking output per team-member-month.

03. The month-one baseline must be honest.

The most common failure mode is anchoring the curve on a deliberately weak baseline. Month one looks bad because the team is learning the tool, the corpus is sparse, and the workflows are not yet wired. Month three looks great by comparison. The 3× compounding number that goes into the QBR deck is real arithmetic against an artificially low denominator.

We use a hardened baseline: the buyer's pre-deployment output for the same workflow over the trailing six months, headcount-adjusted. That baseline is harder to beat. The compounding numbers are smaller. They are also defensible.

What the curve shape tells you.

The shape of the compounding curve is itself a diagnostic. We have learned to read three patterns.

Bent upward. The classical compounding signal. Output volume rises faster than linearly through month six, then plateaus around month nine as the workflow saturates and the next workflow becomes the bottleneck. This is what an architecture install looks like when it is working.

Flat after month four. The deployment is generating useful output but is not learning. The corpus is indexed but is not being added to. The editor corrections are not being folded back into the model. Either the architecture lacks a feedback loop, or the loop exists but is not being used. This is the most common failure mode in feature-led deployments.

Bumpy with periodic crashes. The deployment depends on a vendor model that gets updated on a schedule the buyer does not control. Each model update produces a regression that the editor team has to re-correct. The curve oscillates around a long-term flat trend. This is a structural sign that the weights are not tenant-owned.

Why this metric is audit-defensible.

Compounding output, defined this way, has the property a CFO actually wants from an AI metric: it cannot be gamed without the gaming being visible. Engagement can be inflated by mandating tool use. Adoption can be inflated by re-licensing seats. Productivity uplift can be inflated by surveying optimistic users. Compounding output requires actual outputs being marked as useful by editors who are willing to defend their marks in an audit. The labor of inflating it is most of the labor of producing the real thing.

The audit defense, when it comes, is straightforward. The CFO can produce the workflow-by-workflow output count over time. The editor lead can produce the marked-useful percentage. The HR lead can confirm headcount stayed within the defined band. The ratio falls out of the data without further interpretation. Either the numbers compound or they do not. The board conversation is short.

We wrote about why productivity uplift is a category error because the failure mode of the older metric was so consistent across the audits we ran. Compounding output is what we replaced it with. It is not the only metric we report — the Knyte automation page describes the others — but it is the headline, and it is the one we will defend in front of a board that asks the question CFOs actually ask.

How to start measuring it now.

If you have an existing deployment and have not been measuring compounding output, the cleanest way to start is to pick the deepest workflow in the deployment, define useful output for that workflow specifically, set the baseline as the trailing six months of pre-deployment output, fix headcount, and start counting from this month forward. You will not have a useful curve until month four. By month six you will know whether the deployment is compounding.

If the curve plateaus, you have learned something the engagement metrics would have hidden for another six months. That is the value of the discipline. The metric does not flatter the deployment. It tells you what is happening.

A. VasquezPRINCIPAL THESIS · KNYTE

Former CFO at three growth-stage SaaS companies. Writes the replacement-math frame the Knyte team uses on every architecture call. Stanford GSB; CPA.

RECENT

The eval-trust gap →

RECENT

The renewal trap →

RECENT

The data-classification audit you should have run last year. →

KEEP READING

More from the dispatch.

All posts →

FIG. 22BOARD-LEVEL DELTA / 2026

Operators trust the system. The board does not yet.

THESIS

The eval-trust gap: why AI deployments fail QBR before they fail production.

04.29.2611 MIN · AV

FIG. 232024 → 2027

Strategy slides are being replaced by architecture maps.

THESIS

What replaces "AI strategy" in 2027 board decks.

04.25.2610 MIN · TJ

FIG. 01N = 41 PORTFOLIOS

RENEWAL CYCLE

Year three is when the bill exceeds the output.

THESIS

The renewal trap: how multi-vendor AI subscriptions evaporate by year three.

04.22.2614 MIN · AV

Compounding output is the only AI ROI metric that survives an audit.

How to measure it without lying to yourself.

01. Useful output, defined per workflow.

02. Headcount, held constant by team.

03. The month-one baseline must be honest.

What the curve shape tells you.

Why this metric is audit-defensible.

How to start measuring it now.

More from the dispatch.

The eval-trust gap: why AI deployments fail QBR before they fail production.

What replaces "AI strategy" in 2027 board decks.

The renewal trap: how multi-vendor AI subscriptions evaporate by year three.

Get the dispatch in your inbox.