Skip to main content
ENGINEERING

Streaming model outputs without losing the editorial gate.

Streaming makes the AI feel fast. It also makes editor-in-the-loop architecturally awkward, because the editor is being asked to review an output that is still being produced. Here is the pattern that preserves both.

By J. ReichertPRINCIPAL ENGINEER · KNYTE
PUBLISHEDAPRIL 23, 2026
READ TIME10 MIN
CATEGORYENGINEERING

Streaming model outputs makes AI feel fast. The user sees tokens appearing immediately, the latency-to-first-token is in the hundreds of milliseconds rather than the seconds, and the perceived responsiveness of the system is materially better. Streaming is the right default for almost every AI surface where a user is waiting for an output. It is also architecturally awkward when paired with editor-in-the-loop, because the editor is being asked to review an output that is still being produced. Most teams choose one or the other, and lose something either way.

There is a pattern that preserves both, and the pattern is more decoupled than most teams have built. Streaming is a UX concern. The editorial gate is a system concern. They live at different layers, talk to each other through a defined contract, and produce the experience users want without compromising the controls operators need.

Why the naive integration fails.

The intuitive design is to stream tokens to the editor's surface and let the editor accept or reject as the output completes. The intuition fails for three reasons.

Editorial review needs the full output. An editor cannot meaningfully evaluate a draft that is still being produced. The first half of an output may look fine; the second half may invalidate it. Editor decisions made on partial outputs are systematically lower quality than editor decisions made on completed outputs.

Streaming and validation conflict. Schema-first prompts (we wrote about these here) produce structured outputs that can only be validated once complete. Streaming a structured output to the user creates a UI that displays partial JSON, which is unhelpful at best and misleading at worst.

Editor commitment happens before completion. If the editor accepts a partial output, the system has accepted a commitment on behalf of an output that has not yet been generated. The accepted output may or may not match what the editor saw. This is a category of inconsistency that is hard to debug and worse to defend.

The decoupled pattern.

Streaming and editorial review run on different layers. The streaming surface shows tokens to the user as a preview — an experiential affordance, not a commitment. The editorial gate runs on the completed output, with the user's preview functioning as a thumbnail that the editor can dismiss or expand. The contract between the layers is explicit: the streamed preview is informational; the editorial decision is on the completed artifact.

Three implementation properties.

The stream is read-only at the UX layer. The user sees tokens accumulate. The user cannot accept the output during the stream. The accept affordance unlocks only after the output is complete and the schema validates. The visual cue that the affordance has unlocked is itself part of the editorial UX — the user knows when the system is ready for a decision.

Validation runs on completion, not on tokens. The schema validator does not see partial outputs. It runs against the assembled completion, fails fast on invalid outputs, and triggers a regeneration before any editorial decision is requested. Invalid outputs never reach the editor.

Editorial commitment is logged against the validated artifact. The audit trail records the completed, validated artifact and the editor's decision against it. The streaming preview is not in the audit trail because the streaming preview is not the artifact.

// Stream to UX, validate-and-gate to editor
const stream = await model.generate({ prompt, schema });

// UX: render tokens as they arrive
for await (const token of stream.tokens) {
  ui.appendPreview(token);
}

// Validation: run against the assembled output
const output = await stream.complete();
const validation = schema.validate(output);
if (!validation.ok) {
  return regenerate({ reason: validation.error });
}

// Editorial gate: requested only on the validated artifact
const decision = await editor.review(output);
await audit.log({ workflowId, output, decision });

What this preserves.

The user gets streaming. Latency-to-first-token is fast. Perceived responsiveness is high. The user is not waiting at a spinner.

The editor gets the full artifact. Decisions are made on completed outputs against schema-validated artifacts. The editorial signal is high-quality because the inputs to it are not partial.

The audit trail is unambiguous. The committed artifact is the validated one. The streaming preview is a UX affordance with no commitment semantics.

Where the pattern breaks down.

Two specific cases where the decoupled pattern is not the right answer.

Truly conversational interfaces. A back-and-forth chat where the user is iterating on the model's output in real time is not editor-in-the-loop in the sense we mean. The user is the editor, the conversation is the surface, and the streaming is the medium of the editorial signal. The decoupled pattern over-engineers this case.

Outputs without an emit path. If the workflow generates an output that is shown to the user and never emitted to a downstream system, there is no commitment to gate. The streaming UX is the entire interaction. This is most internal-facing summarization and ad-hoc retrieval, where the chat-sidebar replacement pattern we wrote about applies in a softer form.

For the bulk of enterprise AI workflows — anything that emits to a CRM, sends an email, updates a record, generates a contract amendment — the decoupled pattern is the right architecture. It produces fast UX and durable editorial control, which is the combination most teams have been trying to achieve and have been settling for one or the other of.

The pattern also has a quieter benefit: it makes the system easier to evolve. The streaming layer can be optimized for perceived latency without affecting the editorial gate's correctness guarantees. The editorial gate can add new validation rules without changing the streaming UX. The two layers improve independently. In a tightly-coupled architecture, every change has to consider both concerns simultaneously, which is the kind of integration tax that slows engineering velocity in ways that are hard to attribute back to the architecture itself. The decoupling is what makes the system maintainable across the year of changes that follow the initial ship. Engineering teams that have adopted this split report that their model-promotion cadence increases as a side effect, because the streaming layer rarely needs to change at promotion time and the editorial gate's validation logic is the only thing that needs revisiting against a new model.

J. ReichertPRINCIPAL ENGINEER · KNYTE

Twelve years on production retrieval and inference systems. Previously at Stripe (risk infra) and Anthropic (eval tooling). Writes about the boring parts of agentic infra.

SUBSCRIBE

Get the dispatch in your inbox.

Twice a month. We send the essay, the postmortem, and nothing else. No roundups. No tracking pixels pretending to be personalization.

NO SPAM · UNSUBSCRIBE ANYTIME · 4,200 READERS