Building observable agents: traces, replay, and the audit trail you owe regulators.

There are three different observability artifacts a production agent runtime needs to produce, and most runtimes confuse them. There are logs, which are good for diagnosing the immediate failure of a specific run. There are traces, which are good for understanding how a decision was reached and what would have happened if any input had been different. And there are audit trails, which are good for telling a regulator, a customer, or a board how a decision the system made nine months ago happened in a way that satisfies their need to know.

Most agent runtimes ship with logs. Some ship with traces. Few ship with audit trails. The teams running these systems discover the gap when an incident requires more than logs can answer, or when a buyer review asks the kind of question logs are structurally incapable of answering. By that point the deployment is too far along to retrofit the missing instrumentation cleanly.

What follows is the three-layer observability architecture we ship on every install, the failure modes each layer is designed to surface, and the specific incidents this architecture has paid for itself on.

Layer 1: Logs.

Logs are unstructured-or-lightly-structured text records of what happened during a run. They include the timestamps, the step boundaries, the model calls, the corpus queries, the editor decisions, and any errors. Logs answer the question "what was the immediate cause of this specific failure?" They are the cheapest layer to produce and the most familiar.

Logs are not enough on their own. They cannot answer "why did the model produce this output" because the model is a function of inputs the log does not capture in full — the corpus state, the prompt serialization, the previous editorial decisions in scope, the policy applied. A log can show the input that was sent and the output that came back. It cannot reconstruct the decision space.

Layer 2: Traces.

Traces are structured records of a decision pipeline. Every node in the workflow graph contributes a trace span. Every model call captures its input schema, its output schema, the model version, and the corpus version pin in scope at the time. Every retrieval captures the query, the schema constraints, the policy applied, the candidate set returned, and the final cited result. Every editor action captures the decision, the editor identity, and the rubric applied.

A trace is replayable. Pull the trace, lock the corpus to the captured corpus version, lock the model to the captured model version, and re-run the workflow. The output should be identical (or, for stochastic models, statistically indistinguishable). Replayability is not just a debugging convenience. It is what makes the trace a source of truth for the decision that was made.

Traces answer "why did this output happen" in a way logs cannot. They are also what eval suites run against — the production-seeded eval pattern we wrote about depends on traces being durable, structured, and replayable.

// A trace span captured at the model-call layer
{
  spanId: "spn_a8f...",
  traceId: "tr_19b...",
  workflowId: "wf-014",
  step: "draft",
  modelVersion: "tenant.brand-voice.v3.2",
  corpusVersion: "4.2.1",
  inputSchema: "RenewalBriefInput",
  outputSchema: "RenewalBriefOutput",
  policyVersion: "2026-Q1.v4",
  inputHash: "sha256:7f2a...",
  outputHash: "sha256:c9d1...",
  latencyMs: 1247,
  costUsd: 0.043,
  cited: ["doc:brief.q3-2025.v7", "doc:transcript.acme.2025-11-08.v1"],
  ts: "2026-04-15T03:42:14.018Z"
}

Layer 3: Audit trails.

Audit trails are the regulatory-grade summary of a decision. They take the trace and add the things a regulator or auditor cares about: which version of which policy was in effect, which editor signed off on the output, what the editor's rubric was, what the system would have done in the absence of the editor, and what data classifications were involved at every step. Audit trails are signed and timestamped at write time, and the signature is verifiable later.

An audit trail is not a debugging artifact. It is an external-review artifact. The audience is not the on-call engineer. It is the security reviewer, the internal risk committee, or the customer asking why the system produced a specific output that affected them. The audit trail is the answer.

The audit-trail layer is the one most agent runtimes do not ship with. It is also the one that becomes load-bearing the moment the deployment touches regulated data, contracted enterprise customers, or any decision flow that a board would want to inspect.

Three incident classes this architecture has paid for itself on.

The hallucination investigation. A customer reports that the agent produced an output containing a claim that is not true. With logs alone, the investigation is forensic guesswork. With traces, the investigation is straightforward: pull the trace, see which corpus chunks were cited, evaluate whether the chunks support the claim or whether the model fabricated. We wrote about a specific instance where this turned a multi-day investigation into a forty-minute one.

The review question. A reviewer asks how the system decided not to flag a particular case. With logs, the answer is "we don't have that detail." With audit trails, the answer is "the policy version in effect at that time required these criteria, the input did not satisfy criterion three, the editor on duty reviewed and concurred. Here is the signed record."

The model-promotion regression. A new model candidate is being evaluated. The eval suite reports a regression in one rubric. With traces, the engineer can pull the specific failing cases, see exactly what the new model did differently from the old model on the same input, and root-cause the regression to a specific behavioral change. Without traces, the engineer is reduced to running prompts manually and trying to reproduce.

What "observable" actually requires.

The architecture above implies four operational requirements that most agent runtimes do not satisfy by default.

Trace storage with retention horizons matched to review cycles. Traces need to live as long as the longest review cycle that might query them. For annual enterprise reviews, that is roughly a year. For region-restricted workflows, longer. Traces are large; storage is the dominant operational cost of this architecture.

Replayability with corpus and model version pinning. Replay requires that the corpus and the model version captured in the trace are still retrievable. Corpus versioning we covered separately. Model versioning means keeping old model weights addressable, which is a real ops investment.

Signed audit-trail writes with verifiable signatures. The audit trail's value depends on its integrity. We sign every audit-trail entry at write time with a tenant-scoped key and verify on read. Without signing, the audit trail is a log with extra fields.

A surface that operators actually use. The trace and audit data are useful only if the operator team can query them at the moment of investigation, not three days later. We surface trace and audit data in the Knyte automation page inside the operator's regular workflow review surface, so the latency between incident and answer is the latency of clicking through, not the latency of getting an engineer to query the trace store manually.

Where to start if you have logs and nothing else.

Adding traces and audit trails to a deployment that started with logs is incremental. Start at the model-call layer: every model call captures input schema, output schema, model version, corpus version, and citations. That alone takes you most of the way on hallucination investigation. The retrieval layer comes next. The policy layer last.

The audit-trail layer is built once the trace layer is mature. The audit-trail writes are derived from the trace data — the schema is similar, with additional fields for policy version, editor identity, and signature. We typically ship the trace layer in week two of an install and the audit-trail layer in week six, after the policy and editor surfaces are stable.

Observability is the architectural difference between a deployment you can debug and a deployment you can defend. The investment is real. The payback is the day a regulator or a customer asks the question that logs cannot answer, and you have an answer that holds up.

J. ReichertPRINCIPAL ENGINEER · KNYTE

Twelve years on production retrieval and inference systems. Previously at Stripe (risk infra) and Anthropic (eval tooling). Writes about the boring parts of agentic infra.

RECENT

Postmortem →

RECENT

Streaming model outputs without losing the editorial gate. →

RECENT

Designing queryable memory for enterprise retrieval pipelines. →

KEEP READING

More from the dispatch.

All posts →

FIG. 27PROD / 2026

The cost curve is steeper than most teams have measured.

ENGINEERING

Inference cost as a first-class engineering concern.

04.26.2612 MIN · DC

FIG. 28PROD / 2026

The stream is for the user. The gate is for the system.

ENGINEERING

Streaming model outputs without losing the editorial gate.

04.23.2610 MIN · JR

FIG. 29ARCHITECTURE / 2026

Either weights are tenant-scoped or leakage is possible.

ENGINEERING

Multi-tenant fine-tunes without cross-tenant leakage.

04.20.2613 MIN · DC

Building observable agents: traces, replay, and the audit trail you owe regulators.

Layer 1: Logs.

Layer 2: Traces.

Layer 3: Audit trails.

Three incident classes this architecture has paid for itself on.

What "observable" actually requires.

Where to start if you have logs and nothing else.

More from the dispatch.

Inference cost as a first-class engineering concern.

Streaming model outputs without losing the editorial gate.

Multi-tenant fine-tunes without cross-tenant leakage.

Get the dispatch in your inbox.