Hybrid retrieval: when to combine semantic and lexical search, and how.

Production retrieval is hybrid. Every meaningful corpus query in an enterprise deployment combines two signals: semantic similarity ("find chunks whose meaning is close to the query") and lexical match ("find chunks that contain these specific tokens"). Either signal alone produces a recognizable failure mode. Pure semantic search misses exact-match queries — the user asking for the Q3 retention deck does not want the conceptually-related Q2 deck. Pure lexical search misses paraphrases — the user asking about "churn" gets nothing back if the documents say "customer attrition".

What we have been calling pure semantic retrieval, in most production systems we audit, is not actually pure semantic retrieval. It is semantic retrieval with a quietly degrading user experience that the team has been routing around with manual fixes. Production teams reach the conclusion that they need lexical signal eventually. The question is whether they reach it in week six or in month nine.

What follows is the hybrid retrieval architecture we run in production, including the rank-fusion algorithm we have settled on after testing several, the per-workflow tuning surface, and the production failure modes we have eliminated.

The two signals, and what each one is good at.

Semantic similarity is excellent at concept matching. The query "what did we say about pricing in Q3" returns chunks discussing pricing strategy in Q3 even if they do not contain the literal word "pricing" — they may use "unit economics" or "monetization" or "ARPU." Semantic similarity is also good at handling paraphrase, multilingual content, and domain-specific synonymy.

Semantic similarity is bad at exact-match. The query "WF-014" returns chunks that are conceptually similar to workflow identifiers, which is the wrong answer. The query "Tuaha Jawaid" returns chunks about engineering leadership, which is also the wrong answer. Anything that requires the literal token to be present — IDs, names, code references, error messages — is structurally outside the strength of semantic similarity.

Lexical matching, typically BM25 in production, is exactly the opposite. It excels at exact-match queries and falls apart on paraphrase. "Churn" returns nothing when the documents say "attrition." "Decrease pricing" returns the chunks discussing increases too, because both contain "pricing." Lexical matching is fast, deterministic, and cheap, but its recall ceiling on conceptually-phrased queries is low.

Hybrid retrieval combines both signals so that the query gets the strengths of each. The interesting design decisions are how the two signals are scored, how they are fused, and how the fusion is tuned per workflow.

Reciprocal rank fusion as the production default.

We tried several fusion algorithms — score normalization with weighted averaging, learned-to-rank reranking, ensemble voting — and have settled on reciprocal rank fusion (RRF) as the production default. The algorithm is embarrassingly simple: each document gets a score equal to the sum of 1/(k + rank) across all retrieval methods, where k is a small constant (we use 60). Documents that rank highly in either method appear high in the fused result; documents that rank highly in both appear higher.

The reasons we have stayed with RRF are unglamorous and important. It does not require score calibration between the two methods, which is brittle. It is monotonic — improving rank in either method only helps. It has one tunable parameter. And it is fast enough that we can run it inside the retrieval call without adding meaningful latency.

// Reciprocal rank fusion across semantic and lexical results
function rrf(
  semantic: ReadonlyArray<{ id: string; rank: number }>,
  lexical: ReadonlyArray<{ id: string; rank: number }>,
  k = 60,
): Array<{ id: string; score: number }> {
  const acc = new Map<string, number>();
  for (const r of semantic) acc.set(r.id, (acc.get(r.id) ?? 0) + 1 / (k + r.rank));
  for (const r of lexical) acc.set(r.id, (acc.get(r.id) ?? 0) + 1 / (k + r.rank));
  return [...acc.entries()]
    .map(([id, score]) => ({ id, score }))
    .sort((a, b) => b.score - a.score);
}

Per-workflow tuning.

The fusion ratio is workflow-specific. For workflows where the query distribution is heavily exact-match — engineering tickets, regulatory references, named entities — we run with a higher k for the semantic side, which weakens its contribution and lets lexical dominate. For workflows where queries are mostly conceptual — strategic briefs, customer feedback themes, executive summaries — we run with a lower k for the semantic side and let it dominate.

We tune k per workflow during the install by running a held-out set of historical queries with editor-marked relevant documents and sweeping k values. The optimal k typically falls between 20 and 100. The sweep takes an hour. The deployment runs better for the next year because of it.

When pure semantic actually does win.

There is one workflow category where pure semantic retrieval beats hybrid: cross-language retrieval where the query and the corpus are in different languages, and there is no lexical overlap to recover. The lexical signal contributes nothing useful, and including it adds noise. We turn off the lexical side entirely for these workflows.

Otherwise, hybrid wins everywhere we have measured. The win is not subtle — recall at the editor-marked-relevant set typically improves by twenty to forty percent against pure semantic. The latency cost is negligible because the lexical search runs in parallel.

Production failure modes hybrid eliminates.

Three failure modes we used to chase as separate incidents have been eliminated by hybrid retrieval as a structural matter, rather than as individual fixes.

The named-entity miss. A user searches for "the Acme renewal brief"; the semantic retriever returns conceptually-similar briefs from other accounts. With lexical signal in the fusion, the literal token "Acme" is reweighted into the result.

The version-number miss. A user searches for "v3.2 release notes"; semantic retrieval returns notes for v3.0, v2.4, and v3.5. Hybrid retrieval surfaces the exact version because lexical matching does what semantic matching cannot.

The error-message miss. An engineer searches for the exact text of a stack trace; semantic retrieval returns conceptually-related debugging discussions. Hybrid retrieval surfaces the specific incident report that contained that exact stack trace.

How this fits in the broader retrieval architecture.

Hybrid retrieval lives in the embedding layer of the four-layer queryable-memory architecture. The corpus, schema, and access-policy layers are unchanged. What changes is that the embedding layer maintains two indexes — a semantic index and a lexical (BM25) index — and the retrieval call fans out to both, fuses, and returns. This is one of the architectural reasons we keep the corpus and the indexes separate, and one of the reasons our retrieval architecture treats the embedding layer as ephemeral.

If you are running pure semantic retrieval today and have been routing around its weak spots with re-rankers, manual fallbacks, and ad-hoc lexical filters, the hybrid architecture is what those workarounds are pointing at. The right answer is to surface lexical signal as a peer to semantic signal, fuse with RRF, and tune per workflow. The hacks go away. The recall numbers go up. The deployment stops surprising the editor team.

D. ChoSTAFF ENGINEER · KNYTE

Built distributed retrieval at Pinterest and Databricks. Spends most days inside trace viewers and the rest writing about why your eval suite lies to you.

RECENT

Inference cost as a first-class engineering concern. →

RECENT

Multi-tenant fine-tunes without cross-tenant leakage. →

RECENT

Postmortem →

KEEP READING

More from the dispatch.

All posts →

FIG. 27PROD / 2026

The cost curve is steeper than most teams have measured.

ENGINEERING

Inference cost as a first-class engineering concern.

04.26.2612 MIN · DC

FIG. 28PROD / 2026

The stream is for the user. The gate is for the system.

ENGINEERING

Streaming model outputs without losing the editorial gate.

04.23.2610 MIN · JR

FIG. 29ARCHITECTURE / 2026

Either weights are tenant-scoped or leakage is possible.

ENGINEERING

Multi-tenant fine-tunes without cross-tenant leakage.

04.20.2613 MIN · DC

Hybrid retrieval: when to combine semantic and lexical search, and how.

The two signals, and what each one is good at.

Reciprocal rank fusion as the production default.

Per-workflow tuning.

When pure semantic actually does win.

Production failure modes hybrid eliminates.

How this fits in the broader retrieval architecture.

More from the dispatch.

Inference cost as a first-class engineering concern.

Streaming model outputs without losing the editorial gate.

Multi-tenant fine-tunes without cross-tenant leakage.

Get the dispatch in your inbox.