Naive RAG is a confident liar
Point a chatbot at a folder of PDFs, embed the pages, retrieve the top few, and let a model write an answer. It demos beautifully. Then you put it in front of a question that actually matters, and it tells a customer something that is confidently, fluently, plausibly wrong.
For consumer trivia, a wrong answer is a shrug. For a B2B technical question, where the wrong product gets specified into a machine, it is the only failure that counts. The job here was never raw correctness. It was traceability and calibrated abstention: never be silently wrong, and hand off to a human the moment the answer stops being safe to give.
The contrast below runs the same question two ways. Naive RAG answers every time. The verified agent answers only when it can stand behind the answer, and routes the rest to a person.
An agent proposes. A verifier and a human decide.
The fix is a separation of powers. The language model is never the final authority on a decision. It reads the question and it writes the reply, but the decision in between is made against a cited, versioned base of claims, checked by a verifier, and gated before anything reaches the customer.
That makes the model the interchangeable part. The durable assets are the cited knowledge, the verifier, the gates, and an append-only log of every run. Swap the model for a better one next quarter and the guarantees do not move.
Here is the runtime path for a single inquiry, stage by stage. Each stage is doing one job, and you can see exactly where the model is allowed to reason and where the system takes the decision out of its hands.
Parse the inquiry
Free text becomes structured parameters.
A routine model turns the messy question into typed parameters. The rule is never guess: anything it cannot read stays null and surfaces as an ambiguity, which can trigger a clarifying question rather than an invented value.
Naive RAG skips this. The raw question goes straight to embedding.
You do not retrieve chunks. You retrieve claims.
This is the inversion that makes the rest work. Naive RAG retrieves raw page text and hopes the model reads it correctly. This system retrieves cited claims: small, reviewed, versioned facts that a human has already signed off, each one pointing at the exact page it came from.
Getting there is an offline pipeline, run once per product family. A datasheet is parsed page by page, because the page is the citation unit the audit trail will link back to. There is no token-window chunker and no overlap. The page is the chunk, on purpose. An extraction model reads each page and emits candidate claims, each tagged with its kind and an extraction confidence. Those claims, not the raw pages, are what gets embedded.
Embedding the distilled claim instead of the page sharpens retrieval and makes it bilingual: the datasheets are German and English, the queries arrive in either, and a multilingual embedding lets an English question hit a German claim. The vectors live in Postgres with pgvector, in their own index, keyed by claim id, kept deliberately separate from the relational claim store so the human-readable source of truth never moves.
At query time, retrieval over-fetches four times the claims it needs, then filters to only the claims a human has frozen, in similarity order, until it has enough. That filter is a correctness gate, not a learned reranker: a claim's status is checked against the live store, so a claim approved and then unfrozen after it was embedded still gets excluded. The honest version of this system has no reranker, no hybrid keyword search, and no magic. It has a clean retrieval rule and a store you can read.
Run once per product family. A datasheet becomes reviewed, frozen, cited claims.
Each page is extracted as its own unit of text, with pages too short to matter dropped. The page is kept whole because it is the citation target the audit trail links back to.
Knowledge lives in a store you can read, not in weights
Every claim is one row you can open, cite, correct, and version. That is the whole argument against the tempting shortcut of fine-tuning a model on your documents: weights are a black box. You cannot read one fact out of them, cite it, fix it in isolation, or roll it back. You bury the knowledge instead of owning it.
The field that does the heavy lifting is kind. A constraint is a wall the verifier cannot let positives outvote. A fact can be weighed. A redirect steers to a variant. A claim marked as needing an expert is held out of the live store entirely, so the system flags rather than guesses. Status decides what retrieval can even see: only frozen claims are live.
{
}The field the verifier keys off. A constraint is a wall positives cannot outvote. Other kinds: fact, redirect, needs_expert.
The judge is a model. The facts are not left to it.
After the agent proposes a cited recommendation, a separate critic model checks it: are the citations real, is the answer grounded, were the hard constraints respected? It is told to judge conservatively, because a confident wrong answer is the only true failure and an escalation is an acceptable one.
But the two signals that decide safety are recomputed in code and override the model. Do the cited claim ids actually resolve against the frozen store? That is a set-membership check, not an opinion. Was a load-bearing number computed beyond what any datasheet states? The tool that does the computing flags it at the source. The model adds judgment on top, but it cannot vote those two facts away. Factual signals are decided deterministically, not left to the model.
The model says the answer is grounded and the citation is valid.
Code re-resolves every cited id against the frozen store. One does not exist, so citationsValid is forced false. A set-membership check, not an opinion.
Three confidences, one gate, hard walls
Most systems collapse confidence into a single number, which is exactly how a shaky answer hides behind a strong-looking score. This keeps three, and never blends them. Extraction confidence asks whether the claim was actually in the document or interpreted. Grounding confidence asks whether the agent really retrieved support for what it says. Answer confidence is what is owed to the customer, and it has to fall when a deciding claim was interpreted or a value had to be computed.
The gate that reads those numbers is a plain function, not a model. Two thresholds carry the weight. A computed value that is not datasheet-backed force-escalates unless answer confidence is very high. A low-confidence answer routes to a human regardless. And constraints are walls, not penalties: enough small positives can never outvote one physical impossibility. That is the bug this design exists to kill, the wrong answer wearing a confidence number.
Pick an inquiry and watch the same machinery route it. The four below each break naive RAG in a different way.
Detecting clear glass bottles on a line at 30 to 60 thousand per hour. Which sensor?
ticks on answer = the 0.60 and 0.85 gate thresholds
A frozen constraint says retro-reflective sensing fails on transparent targets. That is a wall, so the agent redirects to the clear-object variant with high confidence, because the deciding claim was a cited constraint.
Naive RAG recommends a standard sensor that cannot see glass.
Every run is replayable with no model at all
Every step of every inquiry, intake, retrieval, tool call, proposal, verdict, gate decision, draft, is appended to a log. The customer-facing prose is rendered from that log, never the other way around. The decision exists first as structured events; the language is the last, thinnest layer on top.
Because the record is complete, a finished inquiry can be reconstructed end to end, including its decision and all three confidences, with zero model calls. That is the audit guarantee, and it doubles as resilience: when the model is unavailable, the system can still serve a faithful replay of a prior run instead of failing in front of a customer.
Where the knowledge compounds
The asset that grows is the claim store. The most valuable knowledge in this domain is not in any datasheet; it is in the heads of senior engineers who are retiring. So the live loop is expert authoring: capture that tacit knowledge as a first-class cited claim, freeze it, embed it, and the runtime agent retrieves it on the very next inquiry. Knowledge moves from a person into an inspectable store, not into model weights.
The schema goes one step further than the live loop currently uses: claims carry full version history, with origin, the version they supersede, and a diff, so a correction can fork a new version rather than overwrite the old one. That is the next increment, and it is designed in rather than bolted on. I am telling you what is wired today and what is scaffolded for tomorrow, because the whole point of the system is to not overstate what it knows.
What it is built on
Nothing exotic, chosen so the parts that matter stay swappable. The orchestration framework runs the agents, tools, and the durable ingestion workflow with its human-review suspend step. Postgres holds the claim store, the vector index, and the event log in one place you can open and inspect. Models sit behind a single config module, so the heavy reasoning model and the routine one are both one line to change, and a fully offline mode is designed for, not yet wired.
Traceability and calibrated abstention
The promise was never that the model would not hallucinate. You cannot fully prevent that in any probabilistic system. The promise is that it cannot hallucinate its way into a customer-facing answer unchecked. The success metric is correct deflection at a near-zero wrong rate, not raw deflection, and certainly not raw answer volume.
That is the architecture I reach for whenever being wrong is expensive: an agent proposes, a verifier and a human decide, and every decision leaves a trail you can replay.