Inference cost as a first-class engineering concern.

Inference cost is the operational metric with the largest impact on AI deployment economics and the smallest amount of engineering attention in most teams we audit. The vendor's published per-token price gets quoted in the procurement deck. The actual cost in production gets measured at the line-item level by finance about a quarter after it surprises everyone. The engineering team that built the workflow rarely owns the cost dimension because the dimension was never a first-class engineering concern.

Treating inference cost as a first-class concern is a small set of architectural and process changes. None of them are heroic. The aggregate effect, in the deployments where we have run this discipline, is a sustained twenty to forty percent reduction in inference cost without any visible quality regression. The reason most teams have not done this work is not that it is hard. It is that nobody on the team owned the question.

Three reasons inference cost surprises teams.

Token economics are not linear in workflow volume. A workflow that processes ten units of work uses more than ten times the tokens of a workflow that processes one unit, because retrieval contexts grow, prompt engineering accumulates, and output verbosity tends to inflate as the team optimizes for accuracy. The per-unit cost rises silently as the workflow matures.

Eval traffic is not budgeted. Eval suites that run on every model promotion produce real inference traffic. Traffic-replay tooling for incident investigation produces real inference traffic. Background corpus refreshes produce real inference traffic. None of this is in the cost model the team built around the production workflow itself. By month nine the eval-and-ops traffic is often a third of the total inference cost.

Failure modes default to retry. A workflow that fails partway through a multi-step generation almost always retries from the beginning, producing duplicate inference cost for every step prior to the failure. Most teams have not measured the duplicate-inference cost as a separate line. In our audits it is consistently 8–14% of total inference, and it is usually fixable with idempotent step boundaries.

What "first-class" actually means in code.

Three concrete properties.

Every model call captures cost as a trace dimension. The same trace span we wrote about in the observability dispatch carries `costUsd` alongside `latencyMs` and `outputHash`. The cost is queryable. Reports are immediate. The team can answer "what does this workflow cost per run" without a finance request.

Per-workflow cost budgets enforce at the runtime level. Each workflow has a declared per-run cost budget. The runtime tracks actual spend against the budget over a rolling window and alerts on drift. Workflows whose cost has crept past the budget surface in the engineering review without anyone having to chase the question.

Eval and ops traffic are line items, not background noise. Every internal traffic source — eval runs, replay sessions, corpus refreshes, regression tests — is tagged at trace time. Reports separate production from internal traffic. The team can answer "which fraction of last month's bill was production" without ambiguity.

// Trace span with cost as a first-class dimension
const span = trace.start({
  workflowId: "wf-014",
  step: "draft",
  modelVersion: "tenant.brand-voice.v3.2",
  trafficSource: "production",  // vs "eval", "replay", "ops"
});
const result = await model.generate(prompt, schema);
span.end({
  costUsd: result.usage.costUsd,
  latencyMs: result.timing.totalMs,
  inputTokens: result.usage.inputTokens,
  outputTokens: result.usage.outputTokens,
});

Three optimizations that consistently pay back.

Cache the deterministic. Many AI workflows include steps that are deterministic given their input — schema validation, classification routing, light summarization. Memoizing these against an input hash eliminates the inference cost without any quality risk. We typically see 15–25% cost reduction on workflows that previously called the model for every step.

Right-size the model per step. A workflow does not need the largest model for every step. The retrieval-rerank step often runs fine on a 7B model that costs a tenth of the production-grade 70B. The classification step often runs fine on a smaller fine-tune. Per-step model selection requires the workflow runtime to support multiple model targets, which most modern runtimes do; the engineering work is to actually use the capability.

Idempotent step boundaries. Workflows whose steps can be re-run independently without re-running the prior steps eliminate the failure-retry cost class. The work is to make the step inputs and outputs explicit and storable, which has the side benefit of making the workflow observable to traces at finer granularity.

Where this fits in the broader architecture.

Inference cost as a first-class concern is one piece of the broader observability architecture. The trace data already required for the audit-trail layer carries the cost dimension as a small extension. The workflow runtime already enforces editorial gates and policy checks; adding cost-budget enforcement is a small extension. The eval suite already runs on a defined cadence; tagging its traffic so the cost is attributable is a small extension.

The reason this pattern is not universal is that none of the extensions are individually urgent. The CFO is not yet asking. The board is not yet asking. The engineering team is busy. The cost compounds anyway. The teams that have adopted the discipline are the ones whose deployment economics will hold up when the conversation does start.

If you are running an AI deployment without per-workflow cost telemetry today, the first sprint of work is bounded — typically two weeks to add cost to the trace span, build the per-workflow cost report, and tag internal traffic sources. The optimizations follow naturally from the visibility. The visibility is what most deployments are missing, and what makes the cost question feel intractable when finance asks it.

The teams that have run this discipline for two quarters or more report a second-order benefit that is harder to attribute and easier to value than the cost reduction itself. The engineering culture around AI workflows shifts. Cost becomes part of the design conversation rather than an emergency that surfaces in finance review. Trade-offs between latency, quality, and cost get made deliberately rather than by default. The deployment economics stop being the kind of thing the team has to defend at the next QBR; they become the kind of thing the team can plan against. That cultural shift is the real return on the visibility investment, and it compounds over years in a way no single optimization does.

D. ChoSTAFF ENGINEER · KNYTE

Built distributed retrieval at Pinterest and Databricks. Spends most days inside trace viewers and the rest writing about why your eval suite lies to you.

RECENT

Multi-tenant fine-tunes without cross-tenant leakage. →

RECENT

Postmortem →

RECENT

Why your RAG stack needs a versioned corpus, not a vector store. →

KEEP READING

More from the dispatch.

All posts →

FIG. 28PROD / 2026

The stream is for the user. The gate is for the system.

ENGINEERING

Streaming model outputs without losing the editorial gate.

04.23.2610 MIN · JR

FIG. 29ARCHITECTURE / 2026

Either weights are tenant-scoped or leakage is possible.

ENGINEERING

Multi-tenant fine-tunes without cross-tenant leakage.

04.20.2613 MIN · DC

FIG. 09PROD / 2026-Q2

Memory is a corpus, not a vector store.

ENGINEERING

Designing queryable memory for enterprise retrieval pipelines.

04.15.2618 MIN · JR

Inference cost as a first-class engineering concern.

Three reasons inference cost surprises teams.

What "first-class" actually means in code.

Three optimizations that consistently pay back.

Where this fits in the broader architecture.

More from the dispatch.

Streaming model outputs without losing the editorial gate.

Multi-tenant fine-tunes without cross-tenant leakage.

Designing queryable memory for enterprise retrieval pipelines.

Get the dispatch in your inbox.