Multi-tenant fine-tunes without cross-tenant leakage.

If your AI deployment serves multiple tenants — distinct customers, business units, or jurisdictions — the fine-tuning architecture has to make a binary choice: weights are tenant-scoped, or weights are shared with a logical partitioning layer on top. The two architectures look similar in a diagram. They have materially different security postures, audit profiles, and failure modes. The intermediate cases — "shared weights with strong access controls" — are not safe in any sense the security team would defend in a regulatory review.

What follows is the architecture we run for multi-tenant fine-tuning, the leakage modes we have seen in shared-weight architectures, and the specific properties that distinguish a tenant-scoped fine-tune from a shared-weight deployment that is pretending to be tenant-scoped.

Why shared weights leak.

A shared-weight fine-tune trains a single model on data from multiple tenants. The model's weights, post-fine-tune, contain a representation of every tenant's data that contributed to the training. The representation is statistical and distributed; it cannot be "removed" for a specific tenant without retraining. Three leakage modes.

Direct generation leakage. A tenant's prompt can elicit content that derives from another tenant's training data. The most common form is verbatim or near-verbatim regurgitation of training examples; a more subtle form is statistical disclosure where the model's outputs reveal patterns specific to another tenant's corpus. Both are documented in the literature; both occur in production.

Membership inference. A tenant can probe the model to determine whether specific examples were in the training set. This is a privacy violation under most regulatory frameworks even when no content is regurgitated, because membership status itself is sensitive information about the contributing tenant.

Prompt-injection-as-extraction. A tenant's prompt can be crafted to instruct the model to disclose its training data. Modern models have some resistance to this, but the resistance is heuristic, not architectural. A determined attacker has options.

What tenant-scoped weights actually require.

The architecture we run on multi-tenant Knyte deployments has four properties. Each is necessary; the absence of any one of them collapses the architecture back to shared-weight.

Per-tenant model weights. Each tenant has its own fine-tune of the base model. The weights are stored in tenant-scoped storage, loaded into tenant-scoped inference instances, and never reside in any shared inference path. This is the architectural primitive; everything else is implementation detail.

Per-tenant training pipelines. The fine-tune cycle for one tenant runs against that tenant's training data only. The pipeline is tenant-scoped at the data ingest, the training compute, and the artifact write. There is no point in the pipeline where another tenant's data could be touched, even accidentally, because the pipeline does not have a path to it.

Per-tenant inference routing. A tenant's request routes only to inference instances loaded with that tenant's weights. The router enforces this at the request layer, not at the response layer. There is no fallback to a shared model. If the tenant's instances are unhealthy, the request fails; it does not silently use a different model.

Per-tenant audit trail. Every inference call against a tenant's model is logged with the tenant identity, the model version pin, and the request hash. The audit log is itself tenant-scoped, with no cross-tenant aggregation that could leak request patterns. The audit log is what defends the architecture during a security review.

// Multi-tenant inference routing
async function infer(tenantId: TenantId, request: Request) {
  const model = await modelRegistry.loadFor(tenantId);  // tenant-scoped
  if (!model) {
    // No fallback. Request fails rather than
    // routing to a shared or different tenant's model.
    throw new TenantModelUnavailable(tenantId);
  }
  const result = await model.generate(request, {
    audit: { tenantId, modelVersion: model.version },
  });
  return result;
}

Why "shared weights with access controls" is not the same.

The most common alternative architecture is shared model weights with access controls at the request layer — the same model serves every tenant, but the request authorization decides which tenant's corpus the model retrieves from. This architecture is operationally simpler and structurally unsafe.

The reason it is unsafe is that the model's weights still encode information from every tenant whose data contributed to the training. The access control is enforced at retrieval time, but the model itself was trained on data from all tenants, and that training is permanent. A tenant whose access control prevents them from retrieving another tenant's documents can still elicit information about that tenant from the model's parameters. The access control was never the architectural primitive; the weights were.

Teams running this architecture often discover the issue during a security review when the auditor asks the question: "if Tenant A's prompt could in principle elicit Tenant B's data from your model, what is the architectural mitigation." The honest answer is "no architectural mitigation, only heuristic mitigations." Most enterprise security postures require an architectural mitigation, which the shared-weight architecture cannot provide.

Operational cost of tenant-scoped weights.

The cost is real. Per-tenant fine-tunes mean N training runs instead of one. Per-tenant inference instances mean a more complex routing layer and higher resource usage at low tenant counts. Per-tenant audit trails mean more storage. The cost is not negligible for deployments serving many tenants of small individual size.

Two ways the cost gets manageable in practice. The training cost amortizes well at moderate tenant scale because tenants share the base model and only the fine-tune layer is per-tenant. The inference cost amortizes well because tenants typically have asymmetric request patterns and instance pooling can serve multiple tenants from a smaller pool than naive provisioning would suggest.

We wrote about tenant-owned weights at the architectural-decision level. This dispatch is the engineering implementation of the same decision. The architectural choice is binary; the engineering implementation has degrees of operational efficiency, but no degree of "shared weights with strong controls" that is equivalent to tenant-scoped weights for security purposes.

If your multi-tenant AI deployment uses a shared-weight architecture today, the right framing is not whether the architecture has produced an incident yet. It is whether you can defend the architecture in front of a security review that asks the architectural-mitigation question. If the answer is no, the migration to tenant-scoped weights is a bounded engineering project. The cost of doing it now is meaningfully smaller than the cost of doing it under regulatory pressure later.

D. ChoSTAFF ENGINEER · KNYTE

Built distributed retrieval at Pinterest and Databricks. Spends most days inside trace viewers and the rest writing about why your eval suite lies to you.

RECENT

Inference cost as a first-class engineering concern. →

RECENT

Postmortem →

RECENT

Why your RAG stack needs a versioned corpus, not a vector store. →

KEEP READING

More from the dispatch.

All posts →

FIG. 27PROD / 2026

The cost curve is steeper than most teams have measured.

ENGINEERING

Inference cost as a first-class engineering concern.

04.26.2612 MIN · DC

FIG. 28PROD / 2026

The stream is for the user. The gate is for the system.

ENGINEERING

Streaming model outputs without losing the editorial gate.

04.23.2610 MIN · JR

FIG. 09PROD / 2026-Q2

Memory is a corpus, not a vector store.

ENGINEERING

Designing queryable memory for enterprise retrieval pipelines.

04.15.2618 MIN · JR

Multi-tenant fine-tunes without cross-tenant leakage.

Why shared weights leak.

What tenant-scoped weights actually require.

Why "shared weights with access controls" is not the same.

Operational cost of tenant-scoped weights.

More from the dispatch.

Inference cost as a first-class engineering concern.

Streaming model outputs without losing the editorial gate.

Designing queryable memory for enterprise retrieval pipelines.

Get the dispatch in your inbox.