Designing queryable memory for enterprise retrieval pipelines.

When a team says "we built RAG," what they usually mean is "we have a vector store and a retrieval call before the model runs." That definition is fine for a prototype. It is not enough to run an enterprise deployment that survives a year of corpus drift, schema changes, and editor-in-the-loop corrections. The vector store is one layer in what should be a four-layer architecture, and treating it as the entire architecture is the source of most of the pain enterprise teams experience around month four.

Queryable memory, as we use the term, is a corpus that survives the vendor cycle, the schema cycle, and the model cycle. It is governed by an access policy the buyer owns. It is observable in production. And it is structured so that an editor's correction in month nine is materially more useful to the model than the same correction would have been in month one. The vector store is one of four components that produce that property. The other three are where most teams accumulate technical debt.

The four layers, and what each one owns.

We architect queryable memory at four distinct layers. Each one has a different lifecycle, a different access pattern, and a different ownership model. Mixing them produces the failure modes we keep being called in to remediate.

Layer 1: The corpus.

The corpus is the source of truth. Every artifact the deployment retrieves over — every brief, every transcript, every product spec, every customer email — lives in the corpus first, in its original format, with the metadata necessary to reconstruct the original document on demand. The corpus is versioned. Every version is addressable. Nothing is ever deleted without a tombstone that explains why.

The corpus is the asset that compounds. The vector store does not. The retrieval pipeline does not. Models come and go; the corpus is what survives. If your queryable memory is architected so that a model swap requires a corpus rebuild, you do not have a corpus — you have a fine-tuning input that you are going to lose the next time the underlying model changes.

Layer 2: The schema.

The schema is the structured representation of what is in the corpus. It includes the document type, the editorial status, the lineage from source systems, the access classification, and the relationships between artifacts. The schema is not the corpus. It is the layer that makes the corpus queryable in ways the embedding layer alone cannot.

Most teams skip this layer or fold it into the vector store as metadata fields. The result is that any query that requires structured filtering — "find all approved campaign briefs from Q3" — has to be implemented as a post-retrieval filter against the vector results, which is both slow and lossy. A real schema layer lets the retriever push the structured constraints down to the corpus index, which is the difference between a 200ms query and a 2s query.

Layer 3: The embedding layer.

The embedding layer is where the vector store actually lives. Its job is to convert chunks of corpus content into vectors, store them with their references back to the corpus, and answer nearest-neighbor queries. It is the component most teams think of as the entire retrieval system. It is the most replaceable layer in the stack.

The embedding layer should be considered ephemeral. If the embedding model changes, the entire embedding layer can be rebuilt from the corpus in a defined window. We test this in production by triggering a full re-embed against a current candidate model on a quarterly cadence. The re-embed runs in the background, swaps in atomically, and the corpus does not move. The teams that conflated the embedding layer with the corpus discover, when they try this, that they cannot.

Layer 4: The access policy.

The access policy defines who can retrieve what. It is enforced at the corpus layer, not at the embedding layer. A query from an editor with broad permissions sees a different result set than the same query from a workflow agent with narrow permissions. The policy is auditable: every retrieval logs the policy version that governed it, the requester identity, and the corpus subset that was visible.

This layer is the one most often missing in teams that built RAG quickly and are now in the middle of a buyer security review. The reviewer wants to know who saw what data, when. If the access policy was enforced inside the embedding query — "we filter the vector results by user role" — the review answer is incomplete because the embedding step itself revealed which vectors were nearest, which is itself a leak. Push the policy down to the corpus layer.

// Corpus query: structured filters pushed to the index, embedding
// nearest-neighbor scoped to the policy-visible subset only.
const result = await corpus.query({
  schema: { docType: "campaign-brief", status: "approved" },
  semantic: { text: queryText, k: 12 },
  policy: { actor, intent: "draft.generate" },
});

Why the four-layer split matters for compounding.

Compounding output, the metric we wrote about in a separate dispatch, depends on the rate at which editor corrections improve future generations. The four-layer split is what makes the corrections useful to month nine instead of useful only to the document being edited.

When an editor marks a generated draft as approved with a specific edit, the system can update the corpus (a new approved exemplar exists), update the schema (the editorial status of related drafts may have shifted), trigger a re-embed of the affected chunks, and revise the access policy if the approval changed the document classification. All four updates happen because the layers are separable. In a single-layer vector-store architecture, only the embedding gets updated, and only for the specific chunk that was edited. The rest of the system does not learn.

The patterns we have stopped recommending.

After a few hundred deployment audits, three retrieval patterns have proven structurally fragile and we have stopped recommending them.

Hosted vector stores as primary memory. A hosted vector store is a perfectly fine embedding layer. It is a terrible primary corpus. The data lives in a vendor's environment, the schema flexibility is constrained by the vendor's product, and the embedding model is tied to the vendor's deployment cycle. Use the hosted store as one possible embedding backend; keep the corpus elsewhere.

Single-vector-per-document indexing. The pattern of producing one embedding per document and querying against the document set directly. It works for very short documents and breaks for everything else. Chunk-level embedding with a parent-document join produces materially better retrieval quality, at the cost of slightly higher storage. The trade-off is overwhelmingly favorable for any corpus over a few thousand documents.

Pure semantic retrieval with no lexical fallback. Semantic similarity is excellent for fuzzy concept matching and bad at exact-match retrieval. "Find the Q3 2025 retention deck" is a lexical query, not a semantic one. Hybrid retrieval — BM25 fused with semantic similarity, scored together — is the production-grade pattern. We have a separate dispatch on hybrid retrieval architecture that goes deeper.

Observability, the layer everyone forgets.

Queryable memory has to be observable. Every retrieval produces a trace: the query text, the schema constraints, the policy applied, the candidate set returned by the embedding layer, the post-retrieval re-rank scores, and the final result that was passed to the model. The trace is durable. The trace is replayable. An editor who wants to know why a generation cited a particular document can pull the trace and see the path.

This is not optional. It is the difference between a deployment that can be debugged and a deployment that produces inexplicable outputs that the editor team has to dismiss case by case. We instrument every retrieval at all four layers and surface the traces in the Knyte automation surface so the operator team has them at the moment of investigation, not three days later.

Where to start if you are mid-deployment.

If you have a deployment running on a single-layer vector-store architecture and the four-layer split is the destination, the migration is not a rewrite. It is an extraction. The corpus is the first thing to extract — pull every document indexed in the vector store back into a corpus you control, with original formats and lineage metadata. The schema and access policy come next. The embedding layer is what gets rebuilt last, against the now-extracted corpus.

We have run this migration in two-week sprints for installs that were already in production. The trick is to keep the existing retrieval pipeline running while the new corpus is being populated, then switch the embedding layer to point at the new corpus, then deprecate the old vector store. The deployment never stops. The compounding curve, in our measurements, bends upward within four to six weeks of the migration as the new corpus starts capturing the editorial signal that the single-layer architecture had been losing.

Queryable memory is the asset that determines whether a deployment compounds. The vector store is the most replaceable component. Architect them separately. The architecture pays back inside a quarter, and it survives every model swap that comes after.

J. ReichertPRINCIPAL ENGINEER · KNYTE

Twelve years on production retrieval and inference systems. Previously at Stripe (risk infra) and Anthropic (eval tooling). Writes about the boring parts of agentic infra.

RECENT

Postmortem →

RECENT

Streaming model outputs without losing the editorial gate. →

RECENT

Model rollback patterns that work under load. →

KEEP READING

More from the dispatch.

All posts →

FIG. 27PROD / 2026

The cost curve is steeper than most teams have measured.

ENGINEERING

Inference cost as a first-class engineering concern.

04.26.2612 MIN · DC

FIG. 28PROD / 2026

The stream is for the user. The gate is for the system.

ENGINEERING

Streaming model outputs without losing the editorial gate.

04.23.2610 MIN · JR

FIG. 29ARCHITECTURE / 2026

Either weights are tenant-scoped or leakage is possible.

ENGINEERING

Multi-tenant fine-tunes without cross-tenant leakage.

04.20.2613 MIN · DC

Designing queryable memory for enterprise retrieval pipelines.

The four layers, and what each one owns.

Layer 1: The corpus.

Layer 2: The schema.

Layer 3: The embedding layer.

Layer 4: The access policy.

Why the four-layer split matters for compounding.

The patterns we have stopped recommending.

Observability, the layer everyone forgets.

Where to start if you are mid-deployment.

More from the dispatch.

Inference cost as a first-class engineering concern.

Streaming model outputs without losing the editorial gate.

Multi-tenant fine-tunes without cross-tenant leakage.

Get the dispatch in your inbox.